Rnaseq exercise

From 22126
Jump to navigation Jump to search

Overview and background

Groups

Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.

Assignment notes

While some questions might seem hard we naturally don’t ask questions/tasks which you have not been given the tools to solve in this assignment - so if you are stuck try thinking about what you have already learned before asking an instructor.

Assignment overview

In this assignment you are going to analyze RNA-sequencing data from real cancer patients to analyze the importance of alternative splicing in a clinical context

Biological background

Today you will be working with colorectal cancers - specifically Colon Adenocarcinoma (often abbreviated COAD). It is a cancer of the colon that is very frequent. The lifetime risk of developing colorectal cancer is ~4% for both males and females. That means COAD represents ~10% of all cancers and results in the death of hundreds of thousands of people each year! (More info on COAD can be found on Wikipedia.

One important aspect of cancer is that tumors from different patients are extremely different even when they originate from the same tissue (more info on tumor heterogeneity here). To improve treatment and prognosis we therefore try to classify COAD into cancer subtypes (a simple form of precision medicine). We currently think there are 5 subtypes (see Liu et al.) and today you will be working with CIN and GS. CIN is an abbreviation for Chromosomal INstable and GS means genome stable. More on that later.

To help us understand COAD subtypes you will today compare these to healthy adjacent tissue. For all samples a biopsy was taken and bulk RNA-seq performed. Low-quality samples have been removed.

Bioinformatic background

For background on transcriptomics and splicing please refer to today’s slides. The data you are working with is a randomly selected a subset of the TCGA COAD data (google TCGA if you want to know more). The data was quantified with Kallisto against the human transcriptome.

Today you will be using the 'pairedGSEA' R package we developed. This package is specifically designed to make it easy to do the following analysis:

  1. Differential gene expression (aka DGE) via DESeq(2)
  2. Differential gene usage (differential splicing) (aka DGU)
  3. gene-set over-representation analysis (ORA) on DGU and DGE results

While at each step facilitating easy comparison of DGE and DGU.


Assignment

Step 1: Determine which cancer to work with

Determine which cancer type you will work with:

  • If your birthday is within the first 6 months of the year (January-June) you will work with CIN.
  • If your birthday is within the last 6 months of the year (July-December) you will work with GS.

Step 2: Set up enviroment

Log into the server as you usually do except this time you have to use the '-X' option. That means using:

ssh -X username@pupil1.healthtech.dtu.dk
.

Make a directory for this exercise and move into it

mkdir transcriptomics_exercise
cd transcriptomics_exercise

Copy the exercise data of your cancer subtype to your folder

### for CIN subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_cin.Rdata .

### For GS subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_gs.Rdata .

Step 3: Start R session and enviroment

To start an R session in your terminal typing (or copy/pasting)

R-4.2.2

And load the library we need by typing

library(pairedGSEA)

This loads the functionality of the “pairedGSEA” R package.

Step 4: Load and inspect data

Load the assignment data into your R session:

### for CIN subtype:
load('coad_iso_subset_cin.Rdata')

### For GS subtype:
load('coad_iso_subset_gs.Rdata')

This will give you two data objects in your R session:

  1. A count matrix
  2. A matrix with meta information about each sample in the count matrix.
  3. A list of gene_sets that you should use for your ORA analysis (step 7).

All objects can be directly used by the 'pairedGSEA' package - no need to do any data modifications.


Use the following functions to take a look at the data:

### List objects in an R session
ls()

### Inspect the first lines of the object
head( <object_name> )

Question: Which object contains what data?

Step 5: Run differential analysis

Next you will need to use the 'pairedGSEA' package and here a bit of self-study is needed. Importantly you should only run this analysis once per group - else we don’t have enough computational power. You can download the 'pairedGSEA' vignette (short document showing how to use it) <a href="https://www.dropbox.com/s/oalth29pxulffec/pairedGSEA.html?dl=1">here</a>.

Hints:

  1. After reading the introduction you can skip to the '3.3 Running the analysis' section.
  2. For now you only need to use 'paired_diff()' as that makes both differential analyses (both DGE and DGU).
  3. There is no need to use the “store_results” option

Question: This will take a while to run (~10 min). In the mean time take a closer look at the Liu et al. paper (see above) and summarise what the difference between the CIN and GS COAD subtypes are.

Step 6: Inspect diffrential result

Question: Look at the first 10 lines of the result file. Which gene is most significant (smallest p-value) for the DGE and DGU analysis (respectively DESeq2 and DEXSeq)

The following code example counts how many significantly differentially expressed genes are found:

sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )

Question: Modify the R code above to count how many genes are DGE and DGU.

Question: Use the 'nrow()' function to calculate the fraction of genes that are DGE and DGU.

Now we are ready to do the gene-set enrichment analysis.

Step 7: Run Gene-Set Enrichment Analysis

Use the vignette to help you use 'pairedGSEA' to run GSEA on both DGE and DGU results (see the vignette section 4: “Over-Representation Analysis”). You should use the 'gene_set_list' object you have already loaded into R instead of using the 'prepare_msigdb()' function.

Note: There is (again) no need to store the intermediary results.

Step 8: Inspect ORA result

What you have been analyzing so far is a subset of the entire dataset (since the runtime else would have been 3-4x longer). To enable a more realistic last step use one of these commands to load the full results corresponding to what you have been working with.

### for CIN subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_cin_ora.Rdata')
# loads the "cin_ora" object

### For GS subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_gs_ora.Rdata')
# loads the gs_ora object

The following code example extract the ORA analysis of either DGU and DGE and sorts it so the most significant gene-sets are at the top.

### DGE:
dge_ora_sorted <- gi_paired_ora[
    sort.list(gi_paired_ora$pval_deseq),                 # sort part
    c('pathway','pval_deseq','enrichment_score_deseq')   # select part
]

### DGU ORA:
dgu_ora_sorted <- gi_paired_ora[
    sort.list(gi_paired_ora$pval_dexseq),                # sort part
    c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]

Question: Look at the 10-15 most significant gene sets from both analyses. What are the similarities and differences?

Step 9: Visual inspection of ORA result

Question: Based on your insights from step 8 use the 'plot_ora()' functionality to test if these are just examples or generalize to all the significant results. An example: If I from the 10-15 top gene-sets saw that only DGU had gene-sets covering “telomer” function I would use the 'plot_ora()' function to test this.

Question: Try to make a hypothesis as to why this/these molecular functions might be important for cancer.

Step 10: Critical self evaluation

Question: Take a moment to think about what potential problems there could be with this assignment. Are there any obvious things we have not taken into consideration?

Step 11: Report result

Go to the blackboard and report one or more of the following:

  • A keyword that showed a similar enrichment pattern in DGU and DGE
  • A keyword that showed preferential regulation through DGU or DGE

Bonus Assignment

Use 'pairedGSEA' to analyze the other COAD cancer subtype (the one you did not analyze). Are the gene-sets similar or different between the subtypes and analysis types?