In this course you will learn applied computational precision
medicine, meaning that you will learn to do analyses in R similar to
those done in clinical settings all over the world. This means that we
will be busy with hands on data analysis and have limited time for long
theoretical presentations. Rather, I will provide you with excellent
online materials (a mix of readings and videos) explaining the methods
we will use that you are required to go through on your own.
Please make sure you go read/watch the materials - I
will not go through them in class. If you are unsure
about aspects of what you read/watch, you are highly
encouraged to seek additional info online. You will then discuss the
materials in groups and I will answer any remaining questions you may
have before we proceed to the fun part: the exercises where you will
learn to apply them. This way we can invest as much of the course as
possible training you to apply these tools to real life data.
Please make sure that you have the latest version of R and Rstudio installed on your laptops,
as well as the following R packages (more will be needed along the
way):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("fgsea", "GSVA"))
install.packages(c("ggplot2", "reshape2", "e1071", "class"))
When: June 1, 9.00-17.00
Where: Building 210, room 002
What: Lecture and self study
When: June 2, 9.00-17.00
Where: Building 210, room 002
What: Q&A, exercises, and self study
Data
RNA-seq
based expression data for glioblastoma multiforme (TCGA
data)
Exercises
#### Course 22123: Computational precision medicine
#### Day 2 practical: working with gene expression data in R
#### By Lars Ronn Olsen
#### 02/06/2023
#### Technical University of Denmark
## Load necessary packages (you can back fill once you know what packages you will use)
## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - Counts")
## Un-"log2" the data
## Check out the distribution of the expression of a sample or two. What kind of distribution are you looking at?
## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - FPKM") - you should be working with this from now on
## Check out the distribution of the expression of a sample or two. What changed? Why should we transform the raw counts?
## Load phenotype data from Xena ("phenotype" - "phenotype")
## Take a closer look at the column called "sample_type.samples" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
## Take a closer look at the column called "batch_number"
## Extract both those variables from the phenotype table along with the column called "submitter_id.samples".
## Do a little cleanup:
## Rename sample type ids to match those in your expression table columns
## Subset the phenotype table to contain only the samples you have expression data for
## Optional: save Rdata object for later use
## Calculate PCA on count matrix
## Why did this not work?
## Solve the problem and try again
## Check out how much variance is explained
## Is this more or less than you would expect?
## Plot the first two principal components of the PCA
## Now try the same with FPKM transformed data
## Do you get the same results in terms of variance explained and visualization of PC1 and PC2? Why/why not? Which do you think is better to use?
## Plot PCA of FPKM and color by batch
## Do you anticipate batch effects in this dataset?
## Plot PCA and color by sample type
## How does this look?
## Make a heatmap of the genes in the list "diff_genes.txt" (https://teaching.healthtech.dtu.dk/22123/diff_genes.txt)
## Do the clusters in the dendrogram correspond to subtypes or batches?
## Calculate the correlation between genes and visualize a low, medium, and high correlation pair in three separate scatter plots
## Visualize the expression of the gene "ENSG00000267112.1 in primary vs normal samples using boxplots
#### Course 22123: Computational precision medicine
#### Day 2 practical: working with gene expression data in R
#### By Lars Ronn Olsen
#### 02/06/2023
#### Technical University of Denmark
## Load necessary packages (you can back fill once you know what packages you will use)
library(ggplot2)
## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - Counts")
gbm_counts <- read.table("TCGA-GBM.htseq_counts.tsv", header = TRUE, row.names = 1)
## Un-"log2" the data
gbm_counts <- round((2^gbm_counts)-1, digits = 0)
## Check out the distribution of the expression of a sample or two. What kind of distribution are you looking at?
df <- data.frame(sample = gbm_counts[,1])
ggplot(df, aes(x = sample)) +
geom_density()
## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - FPKM") - you should be working with this from now on
gbm_fpkm <- read.table("TCGA-GBM.htseq_fpkm.tsv", header = TRUE, row.names = 1)
## Check out the distribution of the expression of a sample or two. What changed? Why should we transform the raw counts?
df <- data.frame(sample = gbm_fpkm[,1])
ggplot(df, aes(x = sample)) +
geom_density()
## Load phenotype data from Xena ("phenotype" - "phenotype")
gbm_pheno <- read.table("TCGA-GBM.GDC_phenotype.tsv", header = TRUE, sep = "\t", quote="")
## Take a closer look at the column called "sample_type.samples" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
table(gbm_pheno$sample_type.samples)
## Take a closer look at the column called "batch_number" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
table(gbm_pheno$batch_number)
## Extract both those variables from the phenotype table along with the column called "submitter_id.samples".
gbm_pheno <- gbm_pheno[,c("submitter_id.samples", "batch_number", "sample_type.samples")]
## Do a little cleanup:
## Rename sample type ids to match those in your expression table columns
gbm_pheno$submitter_id.samples <- gsub(pattern = "\\-", replacement = ".", x = gbm_pheno$submitter_id.samples)
## Subset the phenotype table to contain only the samples you have expression data for
gbm_pheno <- gbm_pheno[gbm_pheno$submitter_id.samples %in% colnames(gbm_fpkm),]
## Optional: save Rdata object for later use
save(gbm_counts, gbm_fpkm, gbm_pheno, file = "gbm.Rdata")
## Calculate PCA on count matrix
pca_count <- prcomp(t(gbm_counts), scale. = TRUE)
## Why did this not work?
## Solve the problem and try again
gbm_counts <- gbm_counts[rowSums(gbm_counts) > 0,]
pca_count <- prcomp(t(gbm_counts), scale. = TRUE)
## Check out how much variance is explained
summary(pca_count)
## Is this more or less than you would expect?
## Plot the first two principal components of the PCA
df <- data.frame(PC1 = pca_count$x[,1], PC2 = pca_count$x[,2])
ggplot(df, aes(x = PC1, y = PC2)) +
geom_point()
## Now try the same with FPKM transformed data
gbm_fpkm <- gbm_fpkm[rowSums(gbm_fpkm) > 0,]
pca_fpkm <- prcomp(t(gbm_fpkm), scale. = TRUE)
summary(pca_fpkm)
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2])
ggplot(df, aes(x = PC1, y = PC2)) +
geom_point()
## Do you get the same results in terms of variance explained and visualization of PC1 and PC2? Why/why not? Which do you think is better to use?
## Plot PCA of FPKM and color by batch
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2], batch = gbm_pheno[match(colnames(gbm_fpkm), gbm_pheno$submitter_id.samples),]$batch_number)
ggplot(df, aes(x = PC1, y = PC2, color = batch)) +
geom_point()
## Do you anticipate batch effects in this dataset?
## Plot PCA and color by sample type
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2], type = gbm_pheno[match(colnames(gbm_fpkm), gbm_pheno$submitter_id.samples),]$sample_type.samples)
ggplot(df, aes(x = PC1, y = PC2, color = type)) +
geom_point()
## How does this look?
## Make a heatmap of the genes in the list "diff_genes.txt"
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))))
## Do the clusters in the dendrogram correspond to subtypes or batches?
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))))
## To check if clusters match subtypes (you can eyeball an annotation bar or do something more fancy if you want)
ha = HeatmapAnnotation(bar = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$sample_type.samples)
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))), top_annotation = ha, row_title = NULL)
## To check if clusters match batches (you can eyeball an annotation bar or do something more fancy if you want)
ha = HeatmapAnnotation(bar = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$batch_number)
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))), top_annotation = ha)
## Calculate the correlation between the genes "ENSG00000249568.1" and "ENSG00000249186.1"
df <- data.frame(gene1 = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000249568.1",], gene2 = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000249186.1",])
ggplot(df, aes(x = gene1, y = gene2)) +
geom_point() +
geom_smooth(method = "lm")
## Visualize the expression of the gene "ENSG00000267112.1" in primary vs normal samples using boxplots
df <- data.frame(gene = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000267112.1",], condition = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$sample_type.samples)
ggplot(df, aes(x = condition, y = gene, group = condition)) +
geom_boxplot()
Paper
decribing molecular subtyping of cancer
Paper
describing bioinformatics pipelines for molecular subtyping of
cancer
k-nearest
neighbor classification (kNN)
Introduction to gene
set enrichment analysis (GSEA)
When: June 6, 9.00-17.00
Where: Building 210, room 002
What: Q&A, exercises, and self study
Data
Day 3 exercise
data
Exercise
#### Course 22123: Computational precision medicine
#### Day 3 practical: clustering and feature selection for molecular subtyping
#### By Lars Ronn Olsen
#### 06/06/2023
#### Technical University of Denmark
## Load necessary packages
## Load exercise data
# CIT_full contains microarray expression of all genes for all samples
# CIT_subtyping contains microarray expression of 245 subtyping genes for all samples
# CIT_class contains the subtypes for all 355 samples
# CIT_genesets contains genes upregulated in each subtype
# Bordet_RNA_tpm contains RNA seq expression of all genes for a cohort
## Check out the distribution of the expression of a sample or two (or all)
## Does it look different from RNA-seq? How? Why? Do you anticipate any problems comparing these data with RNA-seq data?
## Do a leave-one-out of a distance to centroid, a kNN, OR a ssGSEA classification of CIT (or all three if you are feeling confident)
#### Course 22123: Computational precision medicine
#### Day 3 practical: clustering and feature selection for molecular subtyping
#### By Lars Ronn Olsen
#### 06/06/2023
#### Technical University of Denmark
## Load necessary packages
library(ggplot2)
library(e1071)
library(class)
library(GSVA)
## Load expression data (NOTE: this is microarray data, not RNA-seq!)
load("Day3_data.Rdata")
# CIT_full contains the expression of all probes for all samples
# CIT_subtyping contains the expression of the 375 probes for all samples
# CIT_annot contains the annotation for the 375 probes used for classification
# CIT_class contains the subtypes for all 355 samples
## Check out the distribution of the expression of a sample or two (or all)
df <- data.frame(microarray = CIT_full[,1])
ggplot(df, aes(x = microarray)) +
geom_density()
# Does it look different from RNA-seq? How? Why? Do you anticipate any problems comparing these data with RNA-seq data?
# We definitely cannot compare the two!
## Do a leave-one-out classification of CIT subtyping cohort using distance to centroid (DTC)
# make empty prediction vector
pred_vector <- c()
# loop over samples (columns)
for(i in 1:ncol(CIT_subtyping)) {
# make a training matrix consisting of all samples but the one you leave out
training <- CIT_subtyping[,-i]
# remove that samples class from the class vector
training_classes <- CIT_classes[-i]
# make a test vector consisting of only that sample
test <- CIT_subtyping[,i]
# get the class of that sample
test_class <- CIT_classes[i]
# make an empty centroid matrix
centroids <- NULL
# loop over each of the six classes
for (class in unique(CIT_classes)) {
# for each of these classes, subset the training matrix to samples belonging to that class, and calculate the mean expression of each probe in the class
class_centroid <- rowMeans(training[,training_classes==class])
# add the mean vector to the centroids matrix
centroids <- cbind(centroids, class_centroid)
}
# add colnames to the centroid matrix
colnames(centroids) <- unique(CIT_classes)
# calculate the distance of the test sample to the centroids
d <- as.matrix(dist(t(cbind(centroids, test))))
# assign the class of the closest centroid
class_pred <- names(which.min(d[1:6,7]))
# check if you got it right and make a logical vector
pred_vector <- c(pred_vector, test_class==class_pred)
}
# check how many of the cases you got it right
table(pred_vector)
# Does DTC classification work?
## Do a leave-one-out classification of CIT subtyping cohort using kNN
# decide on a k (you could try 5 for a start)
k <- 5
# make empty prediction vector
knn_pred <- c()
# loop over all samples
for(i in 1:ncol(CIT_subtyping)) {
knn_pred <- c(knn_pred, as.vector(knn(train = t(CIT_subtyping[,-i]), test = CIT_subtyping[,i], cl = as.factor(CIT_classes[-i]), k = k)))
}
# check how many of the cases you got it right
table(knn_pred==CIT_classes)
# Does kNN classification work? Maybe try another k?
## Do a classification of CIT training samples using ssGSEA (NOTE: you should use the CIT_full for this)
ssGSEA_enrichments <- gsva(expr = CIT_full, gset.idx.list = CIT_genesets, method = "ssgsea")
ssGSEA_pred <- apply(ssGSEA_enrichments, 2, which.max)
ssGSEA_pred <- rownames(ssGSEA_enrichments)[ssGSEA_pred]
table(ssGSEA_pred==CIT_classes)
Brief
introduction to immune checkpoint inhibitor therapy
Resistance
to immune checkpoint inhibitor therapy (read introduction and review
Figure 1 + Table 1)
Paper describing
predictions of response to immune checkpoint inhibitor therapy using a T
cell inflammation signature (read introduction + section “2.3.1
T-cell–inflamed GEP”)
Tumor
microenvironment and tumor heterogeneity (read introduction, review the
video and Figure 1)
Your turn! Find resources for these
concepts:
Tumor mutational burden (TMB) + how it is measured
Prediction of tumor purity using the algorithm ESTIMATE
Besides TMB
and the “T cell inflamed gene expression profile” there is one more
(clinically approved) prognostic marker for response to checkpoint
inhibitor therapy. Find out what it is.
When: June 7, 9.00-17.00
Where: Building 210, room 002
What: Q&A, exercises, and self study
Data
Day 4 exercise
data
Exercise
## Load necessary packages
## Load data
# expr contains normalized tpm values for clear cell renal cell carcinoma patients treated with PD1 inhibitor
# pheno contains treatment and response data of which you primarily need to consider the columns:
# RNA_ID: sample names. Matches the column names in expr
# ORR: Objective response rate. CRPR: complete response/partial response, PR: partial response, SD: incomplete response/stable disease, PD: progressive disease, NE: inevaluable
# PFS: progression free survival aka how long did patients survive after treatment without progression in disease
## In these samples, collect or quantify the following:
## Tumor mutational burden
## PD-1 expression (note that the gene is called "PDCD1")
## T cell inflamed gene expression profile (in the paper, they do "weighted sum average of normalized expression"), but you can use ssGSEA on the 18 genes
## Does this correlate with ORR classes? What about PFS?
## Estimate tumor purity in each sample and examine the range. How would you anticipate this to affect the three metrics, tumor mutational burden, PD-1 expression, and T cell inflamed gene expression profile?
## Use this function. The package is crap.
library(estimate)
ESTIMATE_fun <- function(expr) {
write.table(as.data.frame(expr), file = "rma.data.gct", quote = F, sep = '\t')
filterCommonGenes(input.f = 'rma.data.gct', output.f = "RMA_10412.gct", id = "GeneSymbol")
estimateScore("RMA_10412.gct", "estimate_score.gct", platform = "affymetrix")
estimate <- read.table("estimate_score.gct", sep = '\t', row.names = 1, header = T, skip = 2); estimate <- estimate[,-1]
# Clean up
file.remove("rma.data.gct");
file.remove("RMA_10412.gct");
file.remove("estimate_score.gct")
# Process scores for easy output
estimate_scores <- as.data.frame(t(estimate))
colnames(estimate_scores) <- c('StromalSignature', 'ImmuneSignature', 'ESTIMATEScore', 'TumorPurity')
return(estimate_scores)
}
## Load necessary packages
library(GSVA)
library(estimate)
library(ggplot2)
## Load data
load("/Users/lro/Library/CloudStorage/GoogleDrive-larsronnolsen@gmail.com/My Drive/Teaching/courses/Computational precision medine/Exercises/day 4/Day4_data.Rdata")
# expr contains normalized tpm values for clear cell renal cell carcinoma patients treated with PD1 inhibitor
# pheno contains treatment and response data of which you primarily need to consider the columns:
# RNA_ID: sample names. Matches the column names in expr
# ORR: Objective response rate. CRPR: complete response/partial response, PR: partial response, SD: incomplete response/stable disease, PD: progressive disease, NE: inevaluable
## First, get pheno and expression profiles organized to match
expr <- expr[, match(pheno$RNA_ID, colnames(expr))]
## In these samples, collect or quantify the following:
## Tumor mutational burden
TMB <- pheno$TMB_Counts
## PD-1 expression
PD1 <- as.matrix(expr)[rownames(expr) == "PDCD1",]
## T cell inflamed gene expression profile (in the paper, they do "weighted sum average of normalized expression"), but you can use ssGSEA on the 18 genes
TcellGEP_sig <- list(TcellGEP= c("CCL5", "CD27", "CD274", "CD276", "CD8A", "CMKLR1", "CXCL9", "CXCR6", "HLA-DQA1", "HLA-DRB1", "HLA-E", "IDO1", "LAG3", "NKG7", "PDCD1LG2", "PSMB10", "STAT1", "TIGIT"))
TcellGEP <- gsva(expr = as.matrix(expr), gset.idx.list = TcellGEP_sig, method = "ssgsea")
## Does this correlate with ORR?
df <- data.frame(TMB = TMB, PD1 = PD1, TcellGEP = as.vector(TcellGEP), ORR = pheno$ORR, PFS = pheno$PFS)
ggplot(df, aes(x = ORR, y = TMB, group = ORR)) +
geom_boxplot() +
ggtitle("TMB")
ggplot(df, aes(x = ORR, y = PD1, group = ORR)) +
geom_boxplot() +
ggtitle("PD1")
ggplot(df, aes(x = ORR, y = TcellGEP, group = ORR)) +
geom_boxplot() +
ggtitle("TcellGEP")
## What about progression free survival (PFS)?
ggplot(df, aes(x = PFS, y = TMB)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("TMB")
ggplot(df, aes(x = PFS, y = PD1)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("PD1")
ggplot(df, aes(x = PFS, y = TcellGEP)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("TcellGEP")
## Estimate tumor purity in each sample and examine the range. How would you anticipate this to affect the three metrics, tumor mutational burden, PD-1 expression, and T cell inflamed gene expression profile?
# Function for ESTIMATE
ESTIMATE_fun <- function(expr) {
write.table(as.data.frame(expr), file = "rma.data.gct", quote = F, sep = '\t')
filterCommonGenes(input.f = 'rma.data.gct', output.f = "RMA_10412.gct", id = "GeneSymbol")
estimateScore("RMA_10412.gct", "estimate_score.gct", platform = "affymetrix")
estimate <- read.table("estimate_score.gct", sep = '\t', row.names = 1, header = T, skip = 2); estimate <- estimate[,-1]
# Clean up
file.remove("rma.data.gct");
file.remove("RMA_10412.gct");
file.remove("estimate_score.gct")
# Process scores for easy output
estimate_scores <- as.data.frame(t(estimate))
colnames(estimate_scores) <- c('StromalSignature', 'ImmuneSignature', 'ESTIMATEScore', 'TumorPurity')
return(estimate_scores)
}
estimate_scores <- ESTIMATE_fun(expr)
df$purity <- estimate_scores$TumorPurity
ggplot(df, aes(x = purity, y = TMB)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("TMB")
ggplot(df, aes(x = purity, y = PD1)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("PD1")
ggplot(df, aes(x = purity, y = TcellGEP)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("TcellGEP")
When: June 8, 9.00-17.00
Where: Building 210, room 002
What: Follow-up on exercises, Q&A, Lecture, exercises, and self study
In this exercise, you will use the online bioinformatics tools to evaluate whether the protein mesothelin (MSLN) is a suitable target for CAR cell therapy of hepatocellular carcinoma You will need to manually keep track of everything you find using these tools to give you final recommendation regarding its suitability.
Task 1: Extract the sequence of the canonical protein from UniProtWhen: June 9
Where: Building 210, room 002
What: Written exam, lecture
When: June 12 - June 21
Where: Where ever you want!
What: Project work
When: June 22
Where: TBA
What: Exam (details TBA)