The purpose and format of this course

In this course you will learn applied computational precision medicine, meaning that you will learn to do analyses in R similar to those done in clinical settings all over the world. This means that we will be busy with hands on data analysis and have limited time for long theoretical presentations. Rather, I will provide you with excellent online materials (a mix of readings and videos) explaining the methods we will use that you are required to go through on your own. Please make sure you go read/watch the materials - I will not go through them in class. If you are unsure about aspects of what you read/watch, you are highly encouraged to seek additional info online. You will then discuss the materials in groups and I will answer any remaining questions you may have before we proceed to the fun part: the exercises where you will learn to apply them. This way we can invest as much of the course as possible training you to apply these tools to real life data.

Preparing for the course

Please make sure that you have the latest version of R and Rstudio installed on your laptops, as well as the following R packages (more will be needed along the way):

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("fgsea", "GSVA"))
install.packages(c("ggplot2", "reshape2", "e1071", "class"))


Day 1: Introduction to computational precision medicine

When: June 1, 9.00-17.00

Where: Building 210, room 002

What: Lecture and self study

Program (click to expand) 9.00 - 10.00: Lecture: Introduction to computational precision medicine
Rest of the day: Go through the assigned curriculum for day 2 (self study)

Today’s learning objectives (click to expand) After today (and the recap tomorrow morning) you will be able to:
* Describe the purpose of bioinformatics in precision medicine, with examples from diagnostics, prognostics, and therapeutics

Curriculum for Day 2 (click to expand) Short history of gene expression profiling
Next generation sequencing of RNA (RNA-seq)
Introduction to RNA-seq data
The Cancer Genome Atlas Legacy: Pushing the Boundaries of Research
Introduction to the UCSC Xena data collection
Introduction to the molecular pathogenesis of cancer (also known as “the hallmarks of cancer”)
Your turn! Find resources for these concepts and how to do them in R:
* Batch effects/technical variance in gene expression data
* Principal component analysis
* Pearson’s correlation coefficient, Spearman’s correlation coefficient, and R^2 (also look up “Simpson’s paradox”)
* Heatmaps, hierarchical clustering, and dendrograms
* Boxplots


Day 2: Brush up on theory + working with expression data in R

When: June 2, 9.00-17.00

Where: Building 210, room 002

What: Q&A, exercises, and self study

Program (click to expand) 9.00 - 10.00: Q&A: introduction to precision medicine + “theoretical brush up”
10.00 - 12.00: Exercise: working with gene expression data in R
Rest of the day: Go through the assigned materials for day 3 (self study)

Today’s learning objectives (click to expand) After today you will be able to:
* Understand how microarrays work
* Understand how RNA sequencing (RNA-Seq) works
* Understand and apply the basic concepts of principal component analysis (PCA)
* Understand and apply the basic concept of correlation between vectors using 1) Pearson’s correlation coefficient, 2) Spearman’s correlation coefficient, and 3) R^2
* Understand the basic concept of Simpson’s paradox and why it matters
* Understand the difference between the data coming out of RNA seq experiments and microarrays
* Understand the basic concept of gene expression matrices
* Understand the basic concept of expression distributions
* Understand the basic concepts of batch effects/technical variance in expression data
Note: when I write “understand” I mean the basic intuition of concepts. You will not be asked about the underlying biology or math unless specified.
Note: whenever I write “apply” it is implied that this is in R (unless otherwise specified)

Exercise (click to expand)

Data
RNA-seq based expression data for glioblastoma multiforme (TCGA data)

Exercises

#### Course 22123: Computational precision medicine
#### Day 2 practical: working with gene expression data in R
#### By Lars Ronn Olsen
#### 02/06/2023
#### Technical University of Denmark


## Load necessary packages (you can back fill once you know what packages you will use)


## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - Counts")


## Un-"log2" the data


## Check out the distribution of the expression of a sample or two. What kind of distribution are you looking at?


## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - FPKM") - you should be working with this from now on


## Check out the distribution of the expression of a sample or two. What changed? Why should we transform the raw counts?


## Load phenotype data from Xena ("phenotype" - "phenotype")


## Take a closer look at the column called "sample_type.samples" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes


## Take a closer look at the column called "batch_number"


## Extract both those variables from the phenotype table along with the column called "submitter_id.samples".


## Do a little cleanup:
## Rename sample type ids to match those in your expression table columns


## Subset the phenotype table to contain only the samples you have expression data for


## Optional: save Rdata object for later use


## Calculate PCA on count matrix
## Why did this not work?


## Solve the problem and try again


## Check out how much variance is explained
## Is this more or less than you would expect?


## Plot the first two principal components of the PCA


## Now try the same with FPKM transformed data
## Do you get the same results in terms of variance explained and visualization of PC1 and PC2? Why/why not? Which do you think is better to use?


## Plot PCA of FPKM and color by batch
## Do you anticipate batch effects in this dataset?


## Plot PCA and color by sample type
## How does this look?


## Make a heatmap of the genes in the list "diff_genes.txt" (https://teaching.healthtech.dtu.dk/22123/diff_genes.txt)


## Do the clusters in the dendrogram correspond to subtypes or batches?


## Calculate the correlation between genes and visualize a low, medium, and high correlation pair in three separate scatter plots


## Visualize the expression of the gene "ENSG00000267112.1 in primary vs normal samples using boxplots


Exercise answers (click to expand)
#### Course 22123: Computational precision medicine
#### Day 2 practical: working with gene expression data in R
#### By Lars Ronn Olsen
#### 02/06/2023
#### Technical University of Denmark

## Load necessary packages (you can back fill once you know what packages you will use)
library(ggplot2)


## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - Counts")
gbm_counts <- read.table("TCGA-GBM.htseq_counts.tsv", header = TRUE, row.names = 1)


## Un-"log2" the data
gbm_counts <- round((2^gbm_counts)-1, digits = 0)


## Check out the distribution of the expression of a sample or two. What kind of distribution are you looking at?
df <- data.frame(sample = gbm_counts[,1])
ggplot(df, aes(x = sample)) +
  geom_density()


## Load expression data from Xena ("gene expression RNAseq" - "HTSeq - FPKM") - you should be working with this from now on
gbm_fpkm <- read.table("TCGA-GBM.htseq_fpkm.tsv", header = TRUE, row.names = 1)


## Check out the distribution of the expression of a sample or two. What changed? Why should we transform the raw counts?
df <- data.frame(sample = gbm_fpkm[,1])
ggplot(df, aes(x = sample)) +
  geom_density()


## Load phenotype data from Xena ("phenotype" - "phenotype")
gbm_pheno <- read.table("TCGA-GBM.GDC_phenotype.tsv", header = TRUE, sep = "\t", quote="")


## Take a closer look at the column called "sample_type.samples" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
table(gbm_pheno$sample_type.samples)


## Take a closer look at the column called "batch_number" and read here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
table(gbm_pheno$batch_number)


## Extract both those variables from the phenotype table along with the column called "submitter_id.samples".
gbm_pheno <- gbm_pheno[,c("submitter_id.samples", "batch_number", "sample_type.samples")]


## Do a little cleanup:
## Rename sample type ids to match those in your expression table columns
gbm_pheno$submitter_id.samples <- gsub(pattern = "\\-", replacement = ".", x = gbm_pheno$submitter_id.samples)


## Subset the phenotype table to contain only the samples you have expression data for
gbm_pheno <- gbm_pheno[gbm_pheno$submitter_id.samples %in% colnames(gbm_fpkm),]


## Optional: save Rdata object for later use
save(gbm_counts, gbm_fpkm, gbm_pheno, file = "gbm.Rdata")


## Calculate PCA on count matrix
pca_count <- prcomp(t(gbm_counts), scale. = TRUE)
## Why did this not work?


## Solve the problem and try again
gbm_counts <- gbm_counts[rowSums(gbm_counts) > 0,]
pca_count <- prcomp(t(gbm_counts), scale. = TRUE)


## Check out how much variance is explained
summary(pca_count)
## Is this more or less than you would expect?


## Plot the first two principal components of the PCA
df <- data.frame(PC1 = pca_count$x[,1], PC2 = pca_count$x[,2])
ggplot(df, aes(x = PC1, y = PC2)) +
  geom_point()


## Now try the same with FPKM transformed data
gbm_fpkm <- gbm_fpkm[rowSums(gbm_fpkm) > 0,]
pca_fpkm <- prcomp(t(gbm_fpkm), scale. = TRUE)
summary(pca_fpkm)
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2])
ggplot(df, aes(x = PC1, y = PC2)) +
  geom_point()
## Do you get the same results in terms of variance explained and visualization of PC1 and PC2? Why/why not? Which do you think is better to use?


## Plot PCA of FPKM and color by batch
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2], batch = gbm_pheno[match(colnames(gbm_fpkm), gbm_pheno$submitter_id.samples),]$batch_number)
ggplot(df, aes(x = PC1, y = PC2, color = batch)) +
  geom_point()
## Do you anticipate batch effects in this dataset?


## Plot PCA and color by sample type
df <- data.frame(PC1 = pca_fpkm$x[,1], PC2 = pca_fpkm$x[,2], type = gbm_pheno[match(colnames(gbm_fpkm), gbm_pheno$submitter_id.samples),]$sample_type.samples)
ggplot(df, aes(x = PC1, y = PC2, color = type)) +
  geom_point()
## How does this look?


## Make a heatmap of the genes in the list "diff_genes.txt"
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))))


## Do the clusters in the dendrogram correspond to subtypes or batches?
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))))

## To check if clusters match subtypes (you can eyeball an annotation bar or do something more fancy if you want)
ha = HeatmapAnnotation(bar = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$sample_type.samples)
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))), top_annotation = ha, row_title = NULL)

## To check if clusters match batches (you can eyeball an annotation bar or do something more fancy if you want)
ha = HeatmapAnnotation(bar = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$batch_number)
Heatmap(t(scale(t(gbm_fpkm[rownames(gbm_fpkm) %in% gene_list,]))), top_annotation = ha)

## Calculate the correlation between the genes "ENSG00000249568.1" and "ENSG00000249186.1"
df <- data.frame(gene1 = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000249568.1",], gene2 = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000249186.1",])
ggplot(df, aes(x = gene1, y = gene2)) +
  geom_point() +
  geom_smooth(method = "lm")

## Visualize the expression of the gene "ENSG00000267112.1" in primary vs normal samples using boxplots
df <- data.frame(gene = as.matrix(gbm_fpkm)[rownames(gbm_fpkm) == "ENSG00000267112.1",], condition = gbm_pheno[match(gbm_pheno$submitter_id.samples, colnames(gbm_fpkm)),]$sample_type.samples)
ggplot(df, aes(x = condition, y = gene, group = condition)) +
  geom_boxplot()


Self-evaluation quiz (click to expand) Multiple choice quiz for todays curriculum. NOTE This is a subset of what you need to know about these topics. It is not a comprehensive list of what you need to know!

1. What can PCA be used for?
a) Visualization of multidimensional sample differences
b) Dimensionality reduction
c) Visualization of batch effects
d) All of the above

2. What are the units of the axes in a PCA plot?
a) Gene expression
b) Variance in the data
c) Eigenvalues
d) Raw component scores

3. What are the units of gene quantification in RNA sequencing?
a) Transcripts per million
b) Fragments per kilobase per million
c) Number of reads mapped to a given transcript
d) All of the above

4. What is the distribution of RNA sequencing counts?
a) Normal distribution
b) Beta distribution
c) Negative binomial distribution
d) T distribution

5. What is the purpose of computational precision medicine?
a) Finding and evaluating new therapeutic targets
b) Supporting the practices of precision medicine with data analysis
c) Supporting of precision diagnostics by large-scale data analysis
d) All of the above

6. What is the difference between Pearson’s and Spearman’s correlation coefficients?
a) Pearson’s correlation coefficient is used to calculate correlations in gene expression values, whereas Pearson’s correlation coefficient is used to calculate correlations in mutational loads
b) Pearson’s correlation evaluates the linear relationship between two continuous variables, whereas Spearman’s correlation evaluates the monotonic relationship
c) Spearmans’s correlation evaluates the linear relationship between two continuous variables, whereas Pearson’s correlation evaluates the monotonic relationship
d) None of the above

7. What are batch effects?
a) Biological variance that we want to remove
b) Biological variance that we want to keep
c) Technical variance that we want to remove
d) Technical variance that we want to keep

8. How many dimensions are produced using PCA?
a) 2
b) 3
c) 4
d) same as the number of features in the orignal space

9. What is the purpose of the UCSC Xena project?
a) Enabling visual data analysis
b) A standardized pre-processing of mulitple omics data sets
c) A repository of gene expression data
d) All of the above

10. Which of the following are NOT a hallmark of cancer
a) Replicative immortality
b) Resistance to immunotherapy
c) Resistance to programmed cell death
d) Genomic instability


Self-evaluation quiz answers (click to expand) d, d, d, c, d, b, c, d, d, b


Curriculum for Day 3 (click to expand)

Paper decribing molecular subtyping of cancer
Paper describing bioinformatics pipelines for molecular subtyping of cancer
k-nearest neighbor classification (kNN)
Introduction to gene set enrichment analysis (GSEA)


Day 3: Computational precision diagnostics: molecular subtyping of cancer

When: June 6, 9.00-17.00

Where: Building 210, room 002

What: Q&A, exercises, and self study

Program (click to expand) 9.00 - 10.00: Follow-up on day 2 exercises + Q&A: Molecular subtyping of cancer
10.00 - 12.00: Exercises: classifying patients samples to existing subtypes
Rest of the day: Go through the assigned materials for day 4 (self study)

Today’s learning objectives After today you will be able to:
* Understand the basic concept of molecular subtyping of cancer
* Describe the concept of distance between vectors
* Apply Euclidean distance to an expression matrix
* Apply Kendall Tau distance between ranked vectors
* Understand and apply distance to centroid classification on microarray data
* Understand and apply the k-nearest neighbor classification algorithm on microarray data
* Describe the different sources of variance (biological and technical) between samples
      * Technical:
            * Gene quantification technology
            * Tissue handling protocol
            * Who handled it
            * When they handled it
            * Where they handled it
            * Preprocessing (reference, tools used, etc)
      * Biological:
            * Sample purity
            * Immune infiltrate
            * Age/sex/ethnicity
            * Prior treatments
            * Preconditions (prior diseases)
            * Time from diagnosis
            * Sample swaps
* Understand how to deal with biological and technical variance between samples and apply solutions in R
* Understand the concept of gene set enrichment analysis (rank-based)
* Understand and apply single sample gene set enrichment analysis (ssGSEA)

Exercises

Data
Day 3 exercise data

Exercise

#### Course 22123: Computational precision medicine
#### Day 3 practical: clustering and feature selection for molecular subtyping
#### By Lars Ronn Olsen
#### 06/06/2023
#### Technical University of Denmark

## Load necessary packages


## Load exercise data

# CIT_full contains microarray expression of all genes for all samples
# CIT_subtyping contains microarray expression of 245 subtyping genes for all samples
# CIT_class contains the subtypes for all 355 samples
# CIT_genesets contains genes upregulated in each subtype
# Bordet_RNA_tpm contains RNA seq expression of all genes for a cohort 


## Check out the distribution of the expression of a sample or two (or all)


## Does it look different from RNA-seq? How? Why? Do you anticipate any problems comparing these data with RNA-seq data?


## Do a leave-one-out of a distance to centroid, a kNN, OR a ssGSEA classification of CIT (or all three if you are feeling confident)
Exercise answers (click to expand)
#### Course 22123: Computational precision medicine
#### Day 3 practical: clustering and feature selection for molecular subtyping
#### By Lars Ronn Olsen
#### 06/06/2023
#### Technical University of Denmark

## Load necessary packages
library(ggplot2)
library(e1071)
library(class)
library(GSVA)

## Load expression data (NOTE: this is microarray data, not RNA-seq!)
load("Day3_data.Rdata")
# CIT_full contains the expression of all probes for all samples
# CIT_subtyping contains the expression of the 375 probes for all samples
# CIT_annot contains the annotation for the 375 probes used for classification
# CIT_class contains the subtypes for all 355 samples


## Check out the distribution of the expression of a sample or two (or all)
df <- data.frame(microarray = CIT_full[,1])
ggplot(df, aes(x = microarray)) +
  geom_density()


# Does it look different from RNA-seq? How? Why? Do you anticipate any problems comparing these data with RNA-seq data?
# We definitely cannot compare the two!


## Do a leave-one-out classification of CIT subtyping cohort using distance to centroid (DTC)
# make empty prediction vector
pred_vector <- c()
# loop over samples (columns)
for(i in 1:ncol(CIT_subtyping)) {
  # make a training matrix consisting of all samples but the one you leave out
  training <- CIT_subtyping[,-i]
  # remove that samples class from the class vector
  training_classes <- CIT_classes[-i]
  # make a test vector consisting of only that sample
  test <- CIT_subtyping[,i]
  # get the class of that sample
  test_class <- CIT_classes[i]
  
  # make an empty centroid matrix
  centroids <- NULL
  # loop over each of the six classes
  for (class in unique(CIT_classes)) {
    # for each of these classes, subset the training matrix to samples belonging to that class, and calculate the mean expression of each probe in the class
    class_centroid <- rowMeans(training[,training_classes==class])
    # add the mean vector to the centroids matrix
    centroids <- cbind(centroids, class_centroid)
  }
  # add colnames to the centroid matrix
  colnames(centroids) <- unique(CIT_classes)
  # calculate the distance of the test sample to the centroids
  d <- as.matrix(dist(t(cbind(centroids, test))))
  # assign the class of the closest centroid
  class_pred <- names(which.min(d[1:6,7]))
  # check if you got it right and make a logical vector
  pred_vector <- c(pred_vector, test_class==class_pred)
}
# check how many of the cases you got it right
table(pred_vector)


# Does DTC classification work?


## Do a leave-one-out classification of CIT subtyping cohort using kNN
# decide on a k (you could try 5 for a start)
k <- 5
# make empty prediction vector
knn_pred <- c()
# loop over all samples
for(i in 1:ncol(CIT_subtyping)) {
  knn_pred <- c(knn_pred, as.vector(knn(train = t(CIT_subtyping[,-i]), test = CIT_subtyping[,i], cl = as.factor(CIT_classes[-i]), k = k)))
}


# check how many of the cases you got it right
table(knn_pred==CIT_classes)


# Does kNN classification work? Maybe try another k?


## Do a classification of CIT training samples using ssGSEA (NOTE: you should use the CIT_full for this)
ssGSEA_enrichments <- gsva(expr = CIT_full, gset.idx.list = CIT_genesets, method = "ssgsea")
ssGSEA_pred <- apply(ssGSEA_enrichments, 2, which.max)
ssGSEA_pred <- rownames(ssGSEA_enrichments)[ssGSEA_pred]
table(ssGSEA_pred==CIT_classes)
Self-evaluation quiz (click to expand) Multiple choice quiz for todays curriculum. NOTE This is a subset of what you need to know about these topics. It is not a comprehensive list of what you need to know!

1. What is the purpose of molecular subtyping cancer samples?
a) To identify specific genetic mutations within cancer cells
b) To determine the appropriate treatment options for individual patients
c) To classify cancer samples into distinct subgroups based on molecular characteristics
d) All of the above

2. In K-nearest neighbors (KNN) classification, what does the value of K represent?
a) The number of features used for classification
b) The distance metric used to calculate similarity between data points
c) The number of neighbors considered for classification
d) The confidence level of the classification result

3. In distance to centroid classification, how is the class label of a new data point determined?
a) By comparing the distances of the data point to all other data points
b) By calculating the average distance of the data point to all other data points
c) By assigning the class label of the centroid closest to the data point
d) By considering the distances of the data point to the centroids of all classes

4. Which of the following is a commonly used approach for molecular subtyping of cancer?
a) Immunohistochemistry (IHC)
b) Next-generation sequencing (NGS)
c) Gene expression profiling
d) All of the above

5. What is the primary objective of single sample gene set enrichment analysis (ssGSEA)?
a) To identify differentially expressed genes in multiple samples
b) To determine the functional enrichment of gene sets in a single sample
c) To quantify the expression levels of individual genes in a given sample
d) To measure the degree of correlation between gene expression patterns in a single sample and predefined gene sets associated with specific biological functions or pathways

6. What are batch effects in gene expression data?
a) Random fluctuations in gene expression levels within a sample
b) Differences in gene expression between different biological conditions
c) Systematic variations in gene expression attributed to technical factors
d) Genetic mutations affecting gene expression patterns

7. When comparing gene expression data of cohorts of patients, why is it important to consider biological confounders?
a) Biological confounders can introduce noise in the gene expression data
b) Biological confounders can lead to biased interpretation of gene expression differences
c) Biological confounders can affect the reliability of gene expression measurements
d) All of the above

8. How is the distance between two ranked vectors typically measured?
a) Hamming distance
b) Manhattan distance
c) Kendall’s tau distance
d) Mahalanobis distance

9. Why am I teaching you to work with microarray data when RNA seq provides a more comprehensive gene expression profiling?
a) It is better than RNA sequencing for cancer samples
b) A lot of cancer subtyping schemes are defined on microarray data
c) It is still cheaper than RNA sequencing
d) All of the above

Self-evaluation quiz answers (click to expand) c, c, c, d, b, c, d, c, b

Curriculum for Day 4

Brief introduction to immune checkpoint inhibitor therapy
Resistance to immune checkpoint inhibitor therapy (read introduction and review Figure 1 + Table 1)
Paper describing predictions of response to immune checkpoint inhibitor therapy using a T cell inflammation signature (read introduction + section “2.3.1 T-cell–inflamed GEP”)
Tumor microenvironment and tumor heterogeneity (read introduction, review the video and Figure 1)
Your turn! Find resources for these concepts:
Tumor mutational burden (TMB) + how it is measured
Prediction of tumor purity using the algorithm ESTIMATE
Besides TMB and the “T cell inflamed gene expression profile” there is one more (clinically approved) prognostic marker for response to checkpoint inhibitor therapy. Find out what it is.

Day 4: Predicting prognostics with treatment: checkpoint inhibitor therapy response

When: June 7, 9.00-17.00

Where: Building 210, room 002

What: Q&A, exercises, and self study

Program 9.00 - 10.00: Follow-up on day 3 exercises + Q&A: checkpoint inhibitor therapy resistance
10.00 - 12.00: Exercise: Prediction of response non-response to checkpoint inhibitor therapy
Rest of the day: Go through the assigned materials for Day 5 (self study)

Today’s learning objectives After today you will be able to:
* Understand and apply the following concepts for prediction of response to checkpoint inhibitor therapy:
      * Tumor mutational burden
      * T cell Inflamed gene expression profile
      * PD-L1 expression
* Understand the concepts of checkpoint inhibitor resistance, and why the three features mentioned above is not enough to fully predict response
* Understand and apply the tool ESTIMATE for estimating tumor purity

Exercises

Data
Day 4 exercise data

Exercise

## Load necessary packages


## Load data
# expr contains normalized tpm values for clear cell renal cell carcinoma patients treated with PD1 inhibitor 
# pheno contains treatment and response data of which you primarily need to consider the columns:
# RNA_ID: sample names. Matches the column names in expr
# ORR: Objective response rate. CRPR: complete response/partial response, PR: partial response, SD: incomplete response/stable disease, PD: progressive disease, NE: inevaluable
# PFS: progression free survival aka how long did patients survive after treatment without progression in disease


## In these samples, collect or quantify the following:


## Tumor mutational burden


## PD-1 expression (note that the gene is called "PDCD1")


## T cell inflamed gene expression profile (in the paper, they do "weighted sum average of normalized expression"), but you can use ssGSEA on the 18 genes


## Does this correlate with ORR classes? What about PFS?


## Estimate tumor purity in each sample and examine the range. How would you anticipate this to affect the three metrics, tumor mutational burden, PD-1 expression, and T cell inflamed gene expression profile?

## Use this function. The package is crap.
library(estimate)
ESTIMATE_fun <- function(expr) {
  write.table(as.data.frame(expr), file = "rma.data.gct", quote = F, sep = '\t')
  
  filterCommonGenes(input.f = 'rma.data.gct', output.f = "RMA_10412.gct", id = "GeneSymbol")
  estimateScore("RMA_10412.gct", "estimate_score.gct", platform = "affymetrix")
  estimate <- read.table("estimate_score.gct", sep = '\t', row.names = 1, header = T, skip = 2); estimate <- estimate[,-1]
  
  # Clean up
  file.remove("rma.data.gct"); 
  file.remove("RMA_10412.gct"); 
  file.remove("estimate_score.gct")
  
  # Process scores for easy output
  estimate_scores <- as.data.frame(t(estimate))
  colnames(estimate_scores) <- c('StromalSignature', 'ImmuneSignature', 'ESTIMATEScore', 'TumorPurity')
  
  return(estimate_scores)
}
Exercise answers (click to expand)
## Load necessary packages
library(GSVA)
library(estimate)
library(ggplot2)

## Load data
load("/Users/lro/Library/CloudStorage/GoogleDrive-larsronnolsen@gmail.com/My Drive/Teaching/courses/Computational precision medine/Exercises/day 4/Day4_data.Rdata")
# expr contains normalized tpm values for clear cell renal cell carcinoma patients treated with PD1 inhibitor 
# pheno contains treatment and response data of which you primarily need to consider the columns:
# RNA_ID: sample names. Matches the column names in expr
# ORR: Objective response rate. CRPR: complete response/partial response, PR: partial response, SD: incomplete response/stable disease, PD: progressive disease, NE: inevaluable

## First, get pheno and expression profiles organized to match
expr <- expr[, match(pheno$RNA_ID, colnames(expr))]


## In these samples, collect or quantify the following:


## Tumor mutational burden
TMB <- pheno$TMB_Counts

## PD-1 expression
PD1 <- as.matrix(expr)[rownames(expr) == "PDCD1",]


## T cell inflamed gene expression profile (in the paper, they do "weighted sum average of normalized expression"), but you can use ssGSEA on the 18 genes
TcellGEP_sig <- list(TcellGEP= c("CCL5", "CD27", "CD274", "CD276", "CD8A", "CMKLR1", "CXCL9", "CXCR6", "HLA-DQA1", "HLA-DRB1", "HLA-E", "IDO1", "LAG3", "NKG7", "PDCD1LG2", "PSMB10", "STAT1", "TIGIT"))
TcellGEP <- gsva(expr = as.matrix(expr), gset.idx.list = TcellGEP_sig, method = "ssgsea")


## Does this correlate with ORR?
df <- data.frame(TMB = TMB, PD1 = PD1, TcellGEP = as.vector(TcellGEP), ORR = pheno$ORR, PFS = pheno$PFS)
ggplot(df, aes(x = ORR, y = TMB, group = ORR)) + 
  geom_boxplot() +
  ggtitle("TMB")
ggplot(df, aes(x = ORR, y = PD1, group = ORR)) + 
  geom_boxplot() +
  ggtitle("PD1")
ggplot(df, aes(x = ORR, y = TcellGEP, group = ORR)) + 
  geom_boxplot() +
  ggtitle("TcellGEP")


## What about progression free survival (PFS)?
ggplot(df, aes(x = PFS, y = TMB)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("TMB")
ggplot(df, aes(x = PFS, y = PD1)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("PD1")
ggplot(df, aes(x = PFS, y = TcellGEP)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("TcellGEP")


## Estimate tumor purity in each sample and examine the range. How would you anticipate this to affect the three metrics, tumor mutational burden, PD-1 expression, and T cell inflamed gene expression profile?

# Function for ESTIMATE
ESTIMATE_fun <- function(expr) {
  write.table(as.data.frame(expr), file = "rma.data.gct", quote = F, sep = '\t')
  
  filterCommonGenes(input.f = 'rma.data.gct', output.f = "RMA_10412.gct", id = "GeneSymbol")
  estimateScore("RMA_10412.gct", "estimate_score.gct", platform = "affymetrix")
  estimate <- read.table("estimate_score.gct", sep = '\t', row.names = 1, header = T, skip = 2); estimate <- estimate[,-1]
  
  # Clean up
  file.remove("rma.data.gct"); 
  file.remove("RMA_10412.gct"); 
  file.remove("estimate_score.gct")
  
  # Process scores for easy output
  estimate_scores <- as.data.frame(t(estimate))
  colnames(estimate_scores) <- c('StromalSignature', 'ImmuneSignature', 'ESTIMATEScore', 'TumorPurity')
  
  return(estimate_scores)
}

estimate_scores <- ESTIMATE_fun(expr)
df$purity <- estimate_scores$TumorPurity

ggplot(df, aes(x = purity, y = TMB)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("TMB")
ggplot(df, aes(x = purity, y = PD1)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("PD1")
ggplot(df, aes(x = purity, y = TcellGEP)) +
  geom_point() +
  geom_smooth(method="lm") +
  ggtitle("TcellGEP")
Self-evaluation quiz (click to expand) Multiple choice quiz for todays curriculum. NOTE This is a subset of what you need to know about these topics. It is not a comprehensive list of what you need to know!

1. Which of the following is a commonly targeted immune checkpoint in cancer therapy?
a) PD-1
b) HER2
c) EGFR
d) BCR-ABL

2. How do immune checkpoint inhibitors work in cancer therapy?
a) They directly kill cancer cells.
b) They release the “brakes” of the immune cells and stimulate them to recognize and attack cancer cells.
c) They prevent the formation of blood vessels that supply nutrients to tumors.
d) They interfere with the division and growth of cancer cells.

3. What is one mechanism of resistance to immune checkpoint inhibitor therapy in cancer?
a) Downregulation of immune checkpoint proteins (targets of the therapy)
b) Activation of tumor suppressor genes
c) Enhanced immune response against cancer cells
d) Mutation of oncogenes

4. Which of the following factors can be used to predict whether a patient will respond to immune checkpoint inhibitor therapy?
a) Expression levels of immune checkpoint proteins in tumor cells
b) The inflammatory environment in the tumor
c) Tumor mutational burden
d) All of the above

5. What is the tumor microenvironment?
a) The size and location of the tumor within the body
b) The genetic composition of cancer cells within the tumor
c) The surrounding cellular and non-cellular components in the tumor’s vicinity
d) The stage of cancer progression and metastasis

6. What is tumor heterogeneity?
a) The ability of a tumor to metastasize to distant organs
b) The genetic stability of cancer cells within a tumor
c) The presence of diverse cell populations with distinct genetic and phenotypic characteristics within a tumor
d) The responsiveness of a tumor to chemotherapy or radiation therapy

7. What is tumor mutational burden (TMB)?
a) The size of the tumor within the body
b) The number of mutations found in a tumor’s DNA
c) The rate at which a tumor is growing
d) The presence of specific genetic mutations in tumor cells

8. Why is TMB relevant in the context of checkpoint inhibitor therapy?
a) It determines the potential number of mutated proteins in the tumor that the immune system can recognize as foreign
b) Highly mutating tumors are more unstable and easier to kill
c) It is the target of checkpoint inhibitor antibodies
d) All of the above

9. What is the purpose of the R package “ESTIMATE”?
a) To perform statistical analysis of gene expression data
b) To estimate the immune and stromal components of tumor microenvironment
c) To visualize the genetic mutations in cancer samples
d) To predict patient response to immunotherapy

10. How was the ESTIMATE prediction algorithm trained?
a) By analyzing patient survival data and correlating it with gene expression patterns
b) By correlating gene expression profiles with known immune and stromal content
c) By manually annotating gene expression profiles with immune and stromal scores
d) Using a reference set of genes associated with immune and stromal components in cancer samples

Self-evaluation quiz answers (click to expand) a, b, a, d, c, c, b, a, b, b

Curriculum for Day 5 Chimeric antigen receptor therapy
This figure outlining what constitutes a good target for chimeric antigen receptor therapy
Paper about the role of alternative splicing in therapy-induced target-loss
Your turn! Figure out what the following bioinformatics tools can be used for and how they work (in very broad strokes, not the math):
* DeepLoc 2.0
* TopCons2
* BepiPred 3.0


Day 5: Computational precision therapeutics: evaluating targets for chimeric antigen receptor therapy

When: June 8, 9.00-17.00

Where: Building 210, room 002

What: Follow-up on exercises, Q&A, Lecture, exercises, and self study

Program 9.00 - 10.30: Follow-up on day 4 exercises, Q&A, Lecture: Online bioinformatics tools for assessing CAR therapy targets (slides)
10.30 - 12.00: Exercise: Assessment of chimeric antigen receptor therapy targets
Rest of the day: study for the test tomorrow
Today’s learning objectives After today you will be able to:
* Understand the basic concept of splice variants/protein isoforms
* Understand the general concept of chimeric antigen receptor (CAR) therapy
* Understand the features of a good CAR target, including:
      * Target expression on tumor cells and not on healthy cells
      * Target subcellular localization
      * Target epitopes
      * Target isoforms
* Understand what DeepLoc2 does
* Understand what TopCons2 does
* Understand what BepiPred 3.0 does

Exercises

In this exercise, you will use the online bioinformatics tools to evaluate whether the protein mesothelin (MSLN) is a suitable target for CAR cell therapy of hepatocellular carcinoma You will need to manually keep track of everything you find using these tools to give you final recommendation regarding its suitability.

Task 1: Extract the sequence of the canonical protein from UniProt

Task 2: Investigate the expression of MSLN in hepatocellular carcinoma and in all healthy tissues using Xenabrowser
Q: In what tissues is the gene recorded expressed?

Task 3: Predict the subcellular location of the protein using DeepLoc
Q: Where does the protein localize to in the cell?

Task 4: Predict the membrane topology of MSLN using TopCons2
Q: Does the protein (provided that it localizes to the outer cell membrane) appear to have a reasonably sized ectodomain for targeting using CARs/antibodies?

Task 5: Predict whether the ectodomain(s) of the protein are likely to harbor structural epitopes using BepiPred. When you have done this, isolate the prediction pertaining to the predicted ectodomains
Q: Are there any predicted epitopes in the ectodomain(s)?
Q: Why do you think that we isolate the ectodomain after predicting epitopes and not before?

Task 6 Predict phosphorylation sites of MSLN using NetPhosP
Q: Are there any phosphorylation sites in the ectodomain?

Task 7 Check back with the UniProt entry for MSLN
Q: Are any of your predicted features experimentally validated?

Task 8 Isolate a promising ectodomain target from the sequence of MSLN and use BLAST to see if the target sequence is found in other human proteins
Q: Do any other human proteins have similar ectodomains that could lead to potential off-target effects?
Q: If yes, do these domains also localize to the membrane?

Task 9 Check if MSLN is expressed in different isoforms
Q: If yes, do all isoforms express the ectodomain and are they equally suitable on all other features? Q: Would you recommend MSLN as a good target for hepatocellular carcinoma? Why? Why not?

Exercise answers
Self-evaluation quiz (click to expand) Multiple choice quiz for todays curriculum. NOTE This is a subset of what you need to know about these topics. It is not a comprehensive list of what you need to know!

1. How does chimeric antigen receptor (CAR) therapy work?
a) CARs are synthetic receptors that directly kill cancer cells by inducing apoptosis.
b) CARs are engineered receptors that bind to specific antigens on cancer cells and activate the immune system to attack them.
c) CARs stimulate the production of antibodies that target cancer cells.
d) CARs directly inhibit the growth and division of cancer cells.

2. What is alternative splicing?
a) The process by which genes are duplicated within a genome
b) The modification of DNA sequences by adding or removing specific nucleotides
c) The process by which mRNA is generated from DNA, involving the removal of introns and joining of exons in different combinations
d) The mechanism by which proteins are synthesized from mRNA templates

3. What are protein isoforms?
a) Different types of proteins that are found in different cellular compartments
b) Proteins that are produced through alternative splicing of mRNA transcripts
c) Proteins that are modified post-translationally by adding or removing specific functional groups
d) Proteins that are involved in the process of protein synthesis

4. What are B cell epitopes?
a) Specific regions on antibodies that bind to antigens
b) Small molecules that activate B cells to produce antibodies
c) Antigens that are recognized and bound by T cells
d) Protein fragments that are recognized and bound by specific antibodies

5. Why is CAR therapy targeting CD19 effective despite CD19 being expressed on normal cells?
a) CAR therapy does kills healthy cells expressing CD19, primarily B cells, the loss of which can be managed with immunoglobulin replacement therapy.
b) Normal cells expressing CD19 have mechanisms to evade CAR-mediated killing
c) CAR therapy is designed to spare normal cells by modulating the binding affinity to CD19
d) Normal cells expressing CD19 are replenished through the body’s natural regenerative processes throughout therapy

6. Why does CAR therapy resistance occur?
a) Low PD1 expression
b) Low T cell inflamed gene expression signature
c) Low tumor mutational burden
d) The tumor cells expressing a splice variant of CD19 arise from selective pressure. This variant does not contain the binding sit for the common CD19 antibody

7. How can we estimate protein isoform expression in tumor samples?
a) Microarray data
b) Whole genome sequencing
c) Single cell ATAC sequencing
d) RNA sequencing

8. Why is it necessary for CAR targets to have an ectodomain?
a) Ectodomains are the only place on a protein that an antibody can bind
b) Ectodomains are specific to tumor expressed proteins
c) Ectodomains are located on the outside of cell membranes, visible to CAR cells
d) All of the above

9. How does the TopCons2 tool work?
a) It is a consensus model, aggregating results from multiple other tools
b) It is based on ssGSEA
c) It is maximum likelihood model
d) It is a deep learning model

Self-evaluation quiz answers (click to expand) b, c, b, d, a, d, d, c, a

Day 6: Multiple choice test + introduction to project work

When: June 9

Where: Building 210, room 002

What: Written exam, lecture

Program 9.00 - 10.00: Multiple choice test, no aids - PLEASE BE ON TIME!
10.30 - 11.00: Introduction to project work
11.00 - 12.00: Group formation
Rest of the day: start project work


Day 7-14: Project work

When: June 12 - June 21

Where: Where ever you want!

What: Project work

Office hours I am available in building 210, room 002 during the following time slots:
12/6: 11.00 - 12.00
13/6: 11.00 - 12.00
14/6: 11.00 - 12.00
19/6: 11.00 - 12.00
21/6: 11.00 - 12.00


Day 15: Oral presentations

When: June 22

Where: TBA

What: Exam (details TBA)

Program TBA