WikiSysop: Created page with "

Overview and background

Groups

Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.

Assignment notes

While some question..."

2024-03-19T15:37:57Z

Created page with " <div class="page-content has-page-title"> <div id="overview-and-background" class="section level1"> <h1>Overview and background</h1> <div id="groups" class="section level2"> <h2>Groups</h2> <p>Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.</p> </div> <div id="assignment-notes" class="section level2"> <h2>Assignment notes</h2> <p>While some question..."

New page

<div class="page-content has-page-title">
<div id="overview-and-background" class="section level1">
<h1>Overview and background</h1>
<div id="groups" class="section level2">
<h2>Groups</h2>
<p>Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.</p>
</div>

<div id="assignment-notes" class="section level2">
<h2>Assignment notes</h2>
<p>While some questions might seem hard we naturally don’t ask questions/tasks which you have not been given the tools to solve in this assignment - so if you are stuck try thinking about what you have already learned before asking an instructor.</p>
</div>

<div id="assignment-overview" class="section level2">
<h2>Assignment overview</h2>
<p>In this assignment you are going to analyze RNA-sequencing data from real cancer patients to analyze the importance of alternative splicing in a clinical context</p>
</div>

<div id="biological-background" class="section level2">
<h2>Biological background</h2>
<p>Today you will be working with colorectal cancers - specifically Colon Adenocarcinoma (often abbreviated COAD). It is a cancer of the colon that is very frequent. The lifetime risk of developing
colorectal cancer is ~4% for both males and females. That means COAD represents ~10% of all cancers and results in the death of hundreds of thousands of people each year! (More info on COAD can be found on [https://en.wikipedia.org/wiki/Colorectal_cancer Wikipedia].</p>

<p>One important aspect of cancer is that tumors from different patients are extremely different even when they originate from the same tissue (more info on tumor heterogeneity [https://en.wikipedia.org/wiki/Tumour_heterogeneity here]). To improve treatment and prognosis we therefore try to classify COAD into cancer subtypes (a simple form of precision medicine). We currently think there are 5 subtypes (see [https://www.cell.com/cancer-cell/pdf/S1535-6108(18)30114-4.pdf Liu ''et al.'']) and today you will be working with CIN and GS. CIN is an abbreviation for Chromosomal INstable and GS means genome stable. More on that later.</p>

<p>To help us understand COAD subtypes you will today compare these to healthy adjacent tissue. For all samples a biopsy was taken and bulk RNA-seq performed. Low-quality samples have been removed.</p>

</div>
<div id="bioinformatic-background" class="section level2">
<h2>Bioinformatic background</h2>
<p>For background on transcriptomics and splicing please refer to today’s slides. The data you are working with is a randomly selected a subset of the TCGA COAD data (google TCGA if you want to know more). The data was quantified with Kallisto against the human transcriptome.</p>

<p>Today you will be using the 'pairedGSEA' R package we developed. This package is specifically designed to make it easy to do the following analysis:</p>

<ol style="list-style-type: decimal">
<li>Differential gene expression (aka DGE) via DESeq(2)</li>
<li>Differential gene usage (differential splicing) (aka DGU)</li>
<li>gene-set over-representation analysis (ORA) on DGU and DGE
results</li>
</ol>
<p>While at each step facilitating easy comparison of DGE and DGU.</p>
<hr />
</div>
</div>
<div id="assignment" class="section level1">
<h1>Assignment</h1>
<div id="step-1-determine-which-cancer-to-work-with" class="section level2">
<h2>Step 1: Determine which cancer to work with</h2>
<p>Determine which cancer type you will work with:</p>
<ul>
<li>If your birthday is within the first 6 months of the year (January-June) you will work with <strong>CIN</strong>.</li>
<li>If your birthday is within the last 6 months of the year (July-December) you will work with <strong>GS</strong>.</li>
</ul>
</div>
<div id="step-2-set-up-enviroment" class="section level2">
<h2>Step 2: Set up enviroment</h2>
<p>Log into the server as you usually do except this time you have to use the '-X' option. That means using:

<pre>
ssh -X username@pupil1.healthtech.dtu.dk</pre>.
</p>

<p>Make a directory for this exercise and move into it</p>
<pre>
mkdir transcriptomics_exercise
cd transcriptomics_exercise
</pre>

<p>Copy the exercise data of your cancer subtype to your folder</p>
<pre>
### for CIN subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_cin.Rdata .

### For GS subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_gs.Rdata .
</pre>

</div>
<div id="step-3-start-r-session-and-enviroment" class="section level2">
<h2>Step 3: Start R session and enviroment</h2>
<p>To start an R session in your terminal typing (or copy/pasting)</p>
<pre>
R-4.2.2
</pre>
<p>And load the library we need by typing</p>
<pre>
library(pairedGSEA)
</pre>

<p>This loads the functionality of the “pairedGSEA” R package.</p>
</div>
<div id="step-4-load-and-inspect-data" class="section level2">
<h2>Step 4: Load and inspect data</h2>
<p>Load the assignment data into your R session:</p>
<pre>
### for CIN subtype:
load('coad_iso_subset_cin.Rdata')

### For GS subtype:
load('coad_iso_subset_gs.Rdata')
</pre>
<p>This will give you two data objects in your R session:</p>
<ol style="list-style-type: decimal">
<li>A count matrix</li>
<li>A matrix with meta information about each sample in the count matrix.</li>
<li>A list of gene_sets that you should use for your ORA analysis (step 7).</li>
</ol>

<p>All objects can be directly used by the 'pairedGSEA'
package - no need to do any data modifications.</p>
<p><br></p>
<p>Use the following functions to take a look at the data:</p>
<pre>
### List objects in an R session
ls()

### Inspect the first lines of the object
head( <object_name> )
</pre>

<p><strong>Question</strong>: Which object contains what data?</p>
<p><strong>Answer</strong>:</p>
<ol style="list-style-type: decimal">
<li>cinCountsSubset : Count data</li>
<li>cinMeta : Condition info (ctrl vs cancer)</li>
<li>gene_set_list : List of gene-sets</li>
</ol>
</div>
<div id="step-5-run-differential-analysis" class="section level2">
<h2>Step 5: Run differential analysis</h2>
<p>Next you will need to use the 'pairedGSEA' package and
here a bit of self-study is needed. <strong>Importantly</strong> you
should only run this analysis once per group - else we don’t have
enough computational power. You can download the
'pairedGSEA' vignette (short document showing how to use it)
<a href="https://www.dropbox.com/s/oalth29pxulffec/pairedGSEA.html?dl=1">here</a>.</p>
<p>Hints:</p>
<ol style="list-style-type: decimal">
<li>After reading the introduction you can skip to the
'3.3 Running the analysis' section.</li>
<li>For now you only need to use 'paired_diff()' as that
makes both differential analyses (both DGE and DGU).</li>
<li>There is no need to use the “store_results” option</li>
</ol>
<p><strong>Question</strong>: This will take a while to run (~10 min).
In the mean time take a closer look at the Liu <em>et al.</em> paper
(see above) and summarise what the difference between the CIN and GS
COAD subtypes are.</p>
<p><strong>Answer</strong>:</p>
<pre>
gi_diff_results <- paired_diff(
object = cinCountsSubset,
metadata = cinMeta, # Use with count matrix or if you want to change it in
# the input object
group_col = 'condition',
sample_col = 'sample_id',
baseline = 'Control',
case = 'COAD_genome_instable',
store_results = FALSE
)
</pre>
</div>
<div id="step-6-inspect-diffrential-result" class="section level2">
<h2>Step 6: Inspect diffrential result</h2>
<p><strong>Question</strong>: Look at the first 10 lines of the result
file. Which gene is most significant (smallest p-value) for the DGE and
DGU analysis (respectively DESeq2 and DEXSeq)</p>
<p><strong>Answer</strong>:</p>
<ul>
<li>DESeq2 (DGE): AAR2</li>
<li>DEXSeq (DGU): A1BG</li>
</ul>
<p><br></p>
<p>The following code <em>example</em> counts how many significantly
differentially expressed genes are found:</p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
</pre>
<p><strong>Question</strong>: Modify the R code above to count how many
genes are DGE and DGU.</p>
<p><strong>Answer</strong></p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
# 4860
sum( gi_diff_results$padj_dexseq < 0.05, na.rm = T )
# 2117
</pre>

<p><strong>Question</strong>: Use the 'nrow()' function to
calculate the fraction of genes that are DGE and DGU.</p>
<p><strong>Answer</strong>:</p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T ) / nrow(gi_diff_results)
# 0.66
sum( gi_diff_results$padj_dexseq < 0.05, na.rm = T ) / nrow(gi_diff_results)
# 0.29
</pre>

<p>Now we are ready to do the gene-set enrichment analysis.</p>
</div>
<div id="step-7-run-gene-set-enrichment-analysis" class="section level2">
<h2>Step 7: Run Gene-Set Enrichment Analysis</h2>
<p>Use the vignette to help you use 'pairedGSEA' to run GSEA on both DGE and DGU results (see the vignette section 4: “Over-Representation Analysis”). You should use the 'gene_set_list' object you have already loaded into R instead of using the 'prepare_msigdb()' function.</p>

<p>Note: There is (again) no need to store the intermediary results.</p>
<p><strong>Answer</strong></p>
<pre>
gi_paired_ora <- paired_ora(
paired_diff_result = gi_diff_results,
gene_sets = gene_set_list,
experiment_title = NULL
)
</pre>
</div>
<div id="step-8-inspect-ora-result" class="section level2">
<h2>Step 8: Inspect ORA result</h2>
<p>What you have been analyzing so far is a subset of the entire dataset
(since the runtime else would have been 3-4x longer). To enable a more
realistic last step use <strong>one</strong> of these commands to load
the full results corresponding to what you have been working with.</p>
<pre>
### for CIN subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_cin_ora.Rdata')
# loads the "cin_ora" object

### For GS subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_gs_ora.Rdata')
# loads the gs_ora object
</pre>
<p>The following code <em>example</em> extract the ORA analysis of
either DGU and DGE and sorts it so the most significant gene-sets are at
the top.</p>

<pre>
### DGE:
dge_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_deseq), # sort part
c('pathway','pval_deseq','enrichment_score_deseq') # select part
]

### DGU ORA:
dgu_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_dexseq), # sort part
c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]
</pre>

<p><strong>Question</strong>: Look at the 10-15 most significant gene
sets from both analyses. What are the similarities and differences?</p>

<p><strong>Answer</strong></p>
<pre>
### DGE:
dge_ora_sorted <- cin_ora[
sort.list(cin_ora$pval_deseq), # sort part
c('pathway','pval_deseq','enrichment_score_deseq') # select part
]

head(dge_ora_sorted, 15)
</pre>

<pre>
## pathway pval_deseq
## 3823 REACTOME_RRNA_PROCESSING 3.694775e-19
## 4433 GOBP_RIBONUCLEOPROTEIN_COMPLEX_BIOGENESIS 4.453501e-17
## 3879 GOBP_RIBOSOME_BIOGENESIS 5.320229e-16
## 3785 KEGG_RIBOSOME 2.192710e-14
## 1061 GOBP_MITOTIC_CELL_CYCLE_PROCESS 1.962214e-13
## 4700 HALLMARK_E2F_TARGETS 2.038376e-13
## 977 REACTOME_CELL_CYCLE 2.567524e-13
## 3759 REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 3.350223e-13
## 3828 REACTOME_SELENOAMINO_ACID_METABOLISM 4.766866e-13
## 4598 HALLMARK_G2M_CHECKPOINT 7.966641e-13
## 3923 REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 3.734833e-12
## 864 GOCC_NUCLEOLUS 6.125940e-12
## 747 REACTOME_INFECTIOUS_DISEASE 7.879724e-12
## 425 GOBP_RESPONSE_TO_ORGANIC_CYCLIC_COMPOUND 8.207220e-12
## 2449 GOCC_ANCHORING_JUNCTION 9.689453e-12
## enrichment_score_deseq
## 3823 0.6239502
## 4433 0.4724997
## 3879 0.5166409
## 3785 0.7224417
## 1061 0.3588459
## 4700 0.5451123
## 977 0.3721868
## 3759 0.6923103
## 3828 0.6601218
## 4598 0.5398054
## 3923 0.6205530
## 864 0.3070925
## 747 0.3249413
## 425 0.3294544
## 2449 0.3259827
</pre>

<ul>
<li>DGE: something with RIBOSOME and CELL_CYCLE</li>
</ul>
<pre class="r">
### DGU ORA:
dgu_ora_sorted <- cin_ora[
sort.list(cin_ora$pval_dexseq), # sort part
c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]
head(dgu_ora_sorted, 15)
</pre>
<pre>
## pathway
## 2757 GOBP_ACTIN_FILAMENT_BASED_PROCESS
## 2449 GOCC_ANCHORING_JUNCTION
## 2787 REACTOME_SIGNALING_BY_RHO_GTPASES_MIRO_GTPASES_AND_RHOBTB3
## 3180 GOMF_NUCLEOSIDE_TRIPHOSPHATASE_REGULATOR_ACTIVITY
## 2259 GOMF_ENZYME_REGULATOR_ACTIVITY
## 2345 GOMF_CYTOSKELETAL_PROTEIN_BINDING
## 2682 GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_PHOSPHORUS_CONTAINING_GROUPS
## 3363 GOBP_REGULATION_OF_SMALL_GTPASE_MEDIATED_SIGNAL_TRANSDUCTION
## 2045 GOCC_SUPRAMOLECULAR_COMPLEX
## 2806 GOMF_PROTEIN_DOMAIN_SPECIFIC_BINDING
## 2869 GOBP_SMALL_GTPASE_MEDIATED_SIGNAL_TRANSDUCTION
## 1781 GOBP_POSITIVE_REGULATION_OF_CATALYTIC_ACTIVITY
## 3047 WP_VEGFAVEGFR2_SIGNALING_PATHWAY
## 2377 GOBP_ORGANOPHOSPHATE_METABOLIC_PROCESS
## 2072 GOBP_CELL_MORPHOGENESIS
## pval_dexseq enrichment_score_dexseq
## 2757 3.504255e-16 0.7528919
## 2449 7.263291e-15 0.7065107
## 2787 1.728464e-14 0.7489251
## 3180 1.837683e-14 0.8509545
## 2259 2.393632e-14 0.6135585
## 2345 4.224209e-14 0.6584802
## 2682 1.400344e-13 0.6628826
## 3363 3.661677e-13 0.9806811
## 2045 4.692953e-13 0.5934650
## 2806 2.857812e-12 0.7084235
## 2869 3.593536e-12 0.7767232
## 1781 5.678506e-12 0.5736414
## 3047 5.845276e-12 0.8127997
## 2377 6.168410e-12 0.6300335
## 2072 1.251178e-11 0.6002377
</pre>
<ul>
<li>DGU: something with ACTIN, JUNCTION and SIGNALING</li>
</ul>
</div>
<div id="step-9-visual-inspection-of-ora-result" class="section level2">
<h2>Step 9: Visual inspection of ORA result</h2>
<p><strong>Question</strong>: Based on your insights from step 8 use the 'plot_ora()' functionality to test if these are just examples or generalize to all the significant results. An example: If I from the 10-15 top gene-sets saw that only DGU had gene-sets covering “telomer” function I would use the 'plot_ora()' function to test this.</p>
<p><strong>Answer</strong></p>

<pre class="r">
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "CELL_CYCLE", # Identify all gene sets about telomeres
cutoff = 0.1, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

[[File:Rnaseq_fig1.png]]

<p>Looks like cell cycle changes are mediated by both (enrichment is on the diagonal) and the majority is significant for both DGE and DGU.</p>

<pre>
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "RIBOSOME", # Identify all gene sets about telomeres
cutoff = 0.33, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

<p>
[[File:Rnaseq_fig2.png]]

Ribosome is clearly mainly significant for DGE.</p>
<pre>
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "ACTIN", # Identify all gene sets about telomeres
cutoff = 0.33, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

[[File:Rnaseq_fig3.png]]

<p>Although many actin-related pathways are significant for both DGU and DGE more are DGU. Also, the enrichment among DGU is more pronounced (points are to the right of the diagonal line).</p>
<p><br></p>

<p>Lastly, note the low correlation suggesting an overall low similarity in biological signaling mediated through DGE and DGU.</p>
<p><strong>Question</strong>: Try to make a hypothesis as to why this/these molecular functions might be important for cancer.</p>

<p><strong>Answer</strong>:</p>
<ul>
<li>CELL_CYCLE: One of the main hallmarks of cancer - uncontrolled cell division.</li>
<li>RIBOSOME: Many ribosomes are needed when cells are dividing (as indicated by increased cell cycle).</li>
<li>ACTIN: Actin is involved in cell movement and thereby cancer invasion and metastasis.</li>
</ul>
</div>
<div id="step-10-critical-self-evaluation" class="section level2">

<h2>Step 10: Critical self evaluation</h2>
<p><strong>Question</strong>: Take a moment to think about what potential problems there could be with this assignment. Are there any obvious things we have not taken into consideration?</p>

<p><strong>Answer</strong>: The main problems are:</p>
<ol style="list-style-type: decimal">
<li>More QC should have been done (clustering, outliers, etc)</li>
<li>This is only a subset of the data (the real dataset has ~300 cancer samples)</li>
<li>We do not take co-factors into account. How many of the effects are due to e.g. gender and age differences?</li>
</ol>
</div>
<div id="step-11-repport-result" class="section level2">

<h2>Step 11: Report result</h2>
<p>Go to the blackboard and report one or more of the following:</p>
<ul>
<li>A keyword that showed a similar enrichment pattern in DGU and DGE</li>
<li>A keyword that showed preferential regulation through DGU or DGE</li>
</ul>
<hr/>
</div>
</div>
<div id="bonus-assignment" class="section level1">
<h1>Bonus Assignment</h1>
<p>Use 'pairedGSEA' to analyze the other COAD cancer subtype (the one you did not analyze). Are the gene-sets similar or different between the subtypes and analysis types?</p>
</div>
</div>

Rnaseq exercise answers - Revision history

WikiSysop: Created page with " Overview and background Groups Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group. Assignment notes While some question..."

Overview and background

Groups

Assignment notes

WikiSysop: Created page with "

Overview and background

Groups

Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.

Assignment notes

While some question..."