Overview and background
Groups
Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.
Assignment notes
While some questions might seem hard we naturally don’t ask questions/tasks which you have not been given the tools to solve in this assignment - so if you are stuck try thinking about what you have already learned before asking an instructor.
Assignment overview
In this assignment you are going to analyze RNA-sequencing data from real cancer patients to analyze the importance of alternative splicing in a clinical context
Biological background
Today you will be working with colorectal cancers - specifically Colon Adenocarcinoma (often abbreviated COAD). It is a cancer of the colon that is very frequent. The lifetime risk of developing colorectal cancer is ~4% for both males and females. That means COAD represents ~10% of all cancers and results in the death of hundreds of thousands of people each year! (More info on COAD can be found on Wikipedia.
One important aspect of cancer is that tumors from different patients are extremely different even when they originate from the same tissue (more info on tumor heterogeneity here). To improve treatment and prognosis we therefore try to classify COAD into cancer subtypes (a simple form of precision medicine). We currently think there are 5 subtypes (see Liu et al.) and today you will be working with CIN and GS. CIN is an abbreviation for Chromosomal INstable and GS means genome stable. More on that later.
To help us understand COAD subtypes you will today compare these to healthy adjacent tissue. For all samples a biopsy was taken and bulk RNA-seq performed. Low-quality samples have been removed.
Bioinformatic background
For background on transcriptomics and splicing please refer to today’s slides. The data you are working with is a randomly selected a subset of the TCGA COAD data (google TCGA if you want to know more). The data was quantified with Kallisto against the human transcriptome.
Today you will be using the 'pairedGSEA' R package we developed. This package is specifically designed to make it easy to do the following analysis:
- Differential gene expression (aka DGE) via DESeq(2)
- Differential gene usage (differential splicing) (aka DGU)
- gene-set over-representation analysis (ORA) on DGU and DGE results
While at each step facilitating easy comparison of DGE and DGU.
Assignment
Step 1: Determine which cancer to work with
Determine which cancer type you will work with:
- If your birthday is within the first 6 months of the year (January-June) you will work with CIN.
- If your birthday is within the last 6 months of the year (July-December) you will work with GS.
Step 2: Set up enviroment
Log into the server as you usually do except this time you have to use the '-X' option. That means using:
ssh -X username@pupil1.healthtech.dtu.dk.
Make a directory for this exercise and move into it
mkdir transcriptomics_exercise cd transcriptomics_exercise
Copy the exercise data of your cancer subtype to your folder
### for CIN subtype: cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_cin.Rdata . ### For GS subtype: cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_gs.Rdata .
Step 3: Start R session and enviroment
To start an R session in your terminal typing (or copy/pasting)
R-4.2.2
And load the library we need by typing
library(pairedGSEA)
This loads the functionality of the “pairedGSEA” R package.
Step 4: Load and inspect data
Load the assignment data into your R session:
### for CIN subtype: load('coad_iso_subset_cin.Rdata') ### For GS subtype: load('coad_iso_subset_gs.Rdata')
This will give you two data objects in your R session:
- A count matrix
- A matrix with meta information about each sample in the count matrix.
- A list of gene_sets that you should use for your ORA analysis (step 7).
All objects can be directly used by the 'pairedGSEA' package - no need to do any data modifications.
Use the following functions to take a look at the data:
### List objects in an R session ls() ### Inspect the first lines of the object head( <object_name> )
Question: Which object contains what data?
Step 5: Run differential analysis
Next you will need to use the 'pairedGSEA' package and here a bit of self-study is needed. Importantly you should only run this analysis once per group - else we don’t have enough computational power. You can download the 'pairedGSEA' vignette (short document showing how to use it) <a href="https://www.dropbox.com/s/oalth29pxulffec/pairedGSEA.html?dl=1">here</a>.
Hints:
- After reading the introduction you can skip to the '3.3 Running the analysis' section.
- For now you only need to use 'paired_diff()' as that makes both differential analyses (both DGE and DGU).
- There is no need to use the “store_results” option
Question: This will take a while to run (~10 min). In the mean time take a closer look at the Liu et al. paper (see above) and summarise what the difference between the CIN and GS COAD subtypes are.
Step 6: Inspect diffrential result
Question: Look at the first 10 lines of the result file. Which gene is most significant (smallest p-value) for the DGE and DGU analysis (respectively DESeq2 and DEXSeq)
The following code example counts how many significantly differentially expressed genes are found:
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
Question: Modify the R code above to count how many genes are DGE and DGU.
Question: Use the 'nrow()' function to calculate the fraction of genes that are DGE and DGU.
Now we are ready to do the gene-set enrichment analysis.
Step 7: Run Gene-Set Enrichment Analysis
Use the vignette to help you use 'pairedGSEA' to run GSEA on both DGE and DGU results (see the vignette section 4: “Over-Representation Analysis”). You should use the 'gene_set_list' object you have already loaded into R instead of using the 'prepare_msigdb()' function.
Note: There is (again) no need to store the intermediary results.
Step 8: Inspect ORA result
What you have been analyzing so far is a subset of the entire dataset (since the runtime else would have been 3-4x longer). To enable a more realistic last step use one of these commands to load the full results corresponding to what you have been working with.
### for CIN subtype: load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_cin_ora.Rdata') # loads the "cin_ora" object ### For GS subtype: load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_gs_ora.Rdata') # loads the gs_ora object
The following code example extract the ORA analysis of either DGU and DGE and sorts it so the most significant gene-sets are at the top.
### DGE: dge_ora_sorted <- gi_paired_ora[ sort.list(gi_paired_ora$pval_deseq), # sort part c('pathway','pval_deseq','enrichment_score_deseq') # select part ] ### DGU ORA: dgu_ora_sorted <- gi_paired_ora[ sort.list(gi_paired_ora$pval_dexseq), # sort part c('pathway','pval_dexseq','enrichment_score_dexseq') # select part ]
Question: Look at the 10-15 most significant gene sets from both analyses. What are the similarities and differences?
Step 9: Visual inspection of ORA result
Question: Based on your insights from step 8 use the 'plot_ora()' functionality to test if these are just examples or generalize to all the significant results. An example: If I from the 10-15 top gene-sets saw that only DGU had gene-sets covering “telomer” function I would use the 'plot_ora()' function to test this.
Question: Try to make a hypothesis as to why this/these molecular functions might be important for cancer.
Step 10: Critical self evaluation
Question: Take a moment to think about what potential problems there could be with this assignment. Are there any obvious things we have not taken into consideration?
Step 11: Report result
Go to the blackboard and report one or more of the following:
- A keyword that showed a similar enrichment pattern in DGU and DGE
- A keyword that showed preferential regulation through DGU or DGE
Bonus Assignment
Use 'pairedGSEA' to analyze the other COAD cancer subtype (the one you did not analyze). Are the gene-sets similar or different between the subtypes and analysis types?