22100 - R for Bio Data Science

Getting Started

First of all, make sure to read every line in these exercises carefully!

For these exercises, work 2-and-2 in your desk groups. If you get stuck, go visit the chapter on data visualisation you prepared for today or try to practise your google’ing skills. I will of course also be circulating and available for questions.

First, go to your RStudio Cloud session from last time and login and choose the project you created.

Recall the layout of the IDE (Integrated Development Environment)

Then, before we start, we need to install the packages we need, so without further ado, run each of the following lines separatly in your console:

install.packages("tidyverse")
install.packages("devtools")
library("devtools")
install_github("ramhiser/datamicroarray")

Once this has been installed after ~5 minutes, create a new rmarkdown document (File -> New File -> R Markdown…)

Recall the syntax for a new code chunk:

```{r}
# Here goes the code... Note how this part does not get executed
```

Short-cuts:

Mac: CMD + OPTION + i
Windows: CTRL + OPTION + i

Now, in your rmarkdown document, add a new code chunk with the command

library("tidyverse")
library("datamicroarray")

This will load out data science toolbox, including ggplot and a set of various bio-data. You can read more about the datamicroarray here

Create data

Before we can visualise the data, we need to wrangle it a bit. Nevermind the details here, we will get to that later. Just create a new chunk, copy/paste the below code and run it:

data('gravier', package = 'datamicroarray')
set.seed(676571)
cancer_data=mutate(as_tibble(pluck(gravier,"x")),y=pluck(gravier,"y"),pt_id=1:length(pluck(gravier, "y")),age=round(rnorm(length(pluck(gravier,"y")),mean=55,sd=10),1))
cancer_data=rename(cancer_data,event_label=y)
cancer_data$age_group=cut(cancer_data$age,breaks=seq(10,100,by=10))

Now we have the data set as an tibble, which is an augmented data frame (we will also get to that later):

select(slice(cancer_data,1:8),pt_id,age,age_group,event_label,1:5)

## # A tibble: 8 x 9
##   pt_id   age age_group event_label    g2E09    g7F07    g1A01   g3C09
##   <int> <dbl> <fct>     <fct>          <dbl>    <dbl>    <dbl>   <dbl>
## 1     1  34.2 (30,40]   good        -0.00144 -0.00144 -0.0831  -0.0475
## 2     2  47   (40,50]   good        -0.0604   0.0129  -0.00144  0.0104
## 3     3  60.3 (60,70]   good         0.0398   0.0524  -0.0786   0.0635
## 4     4  57.8 (50,60]   good         0.0101   0.0314  -0.0218   0.0215
## 5     5  54.9 (50,60]   good         0.0496   0.0201   0.0370   0.0311
## 6     6  58.8 (50,60]   good        -0.0664   0.0468   0.00720 -0.370 
## 7     7  52.9 (50,60]   good        -0.00289 -0.0816  -0.0291  -0.0249
## 8     8  74.5 (70,80]   good        -0.198   -0.0499  -0.0634  -0.0298
## # … with 1 more variable: g3H08 <dbl>

This is just the first 8 rows and the first 8 columns, so

Q1: What is this data?

Hint: Where did the data come from?

Q2: How many rows and columns are there in the data set in total?

Hint: Do you think you are the first person in the world to try to find out how many rows and columns are in a data set in R?

Q3: Which are the variables and which are the observations in relation to rows and columns?

ggplot - The Very Basics

General Syntax

The general syntax for a basic ggplot is:

ggplot(data = my_data,
       mapping = aes(x = variable_1_name, y = variable_2_name)) +
  geom_something() +
  labs()

Note the + for adding layers to the plot

ggplot the plotting function
my_data the data you want to plot
aes() the mappings of your data to the plot
x data for the x-axis
y data for the y-axis
geom_something() the representation of your data
labs() the x-/y-labels, title, etc.

A very handy cheat-sheet can be found here

Basic Plots

Remember to write notes in your rmarkdown document. You will likely revisit these basic plots in future exercises.

Primer: Plotting 2 x 20 random normally distributed numbers, can be done like so:

ggplot(data = tibble(x = rnorm(20), y = rnorm(20)),
       mapping = aes(x = x, y = y)) +
  geom_point()

Using this small primer, the materials you read for today and the cancer_data you created, in separate code-chunks, create a:

T1: scatterplot of one variable against another
T2: linegraph of one variable against another
T3: boxplot of one variable (Hint: Set x = "my_gene" in aes())
T4: histogram of one variable
T5: densitogram of one variable

Remember to write notes to yourself, so you know what you did and if there is something in particular you want to remember.

Q4: Do all geoms require both x and y?

Extending Basic Plots

T6: Pick your favourite gene and create a boxplot of expression levels stratified on the variable event_label
T7: Like T6, but with densitograms
T8: Pick your favourite gene and create a boxplot of expression levels stratified on the variable age_group
- Then, add stratification on event_label
- Then, add transparency to the boxes
- Then, add some labels
T9: Pick your favourite gene and create a scatter-plot of expression levels versus age
- Then, add stratification on event_label
- Then, add a smoothing line
- Then, add some labels
T10: Pick your favourite two genes and create a scatter-plot of their expression levels
- Then, add stratification on event_label
- Then, add a smoothing line
- Then, show split into seperate panes based on the variable age_group
- Then, add some labels
- Change the event_label title of the legend
T11: Recreate the following plot

Q5: Using your biological knowledge, what is your interpretation of the plot?
T12: Recreate the following plot

Q6: Using your biological knowledge, what is your interpretation of the plot?
T13: If you arrive here and there is still time left for the exercises, you are probably already familiar with ggplot - Use what time is left to challenge yourself to further explore the cancer_data and create some nice data visualisations - Show me what you come up with!

Further ressources for data visualisation

A very handy cheat-sheet can be found here
So which plot to choose? Check this handy guide
Explore ways of plotting here