First of all, make sure to read every line in these exercises carefully!
For these exercises, work 2-and-2 in your desk groups. If you get stuck, go visit the chapter on data visualisation you prepared for today or try to practise your google’ing skills. I will of course also be circulating and available for questions.
First, go to your RStudio Cloud session from last time and login and choose the project you created.
Recall the layout of the IDE (Integrated Development Environment)
Then, before we start, we need to install the packages we need, so without further ado, run each of the following lines separatly in your console:
install.packages("tidyverse")
install.packages("devtools")
library("devtools")
install_github("ramhiser/datamicroarray")
Once this has been installed after ~5 minutes, create a new rmarkdown document (File -> New File -> R Markdown…)
Recall the syntax for a new code chunk:
```{r}
# Here goes the code... Note how this part does not get executed
```
Short-cuts:
Now, in your rmarkdown document, add a new code chunk with the command
library("tidyverse")
library("datamicroarray")
This will load out data science toolbox, including ggplot
and a set of various bio-data. You can read more about the datamicroarray
here
Before we can visualise the data, we need to wrangle it a bit. Nevermind the details here, we will get to that later. Just create a new chunk, copy/paste the below code and run it:
data('gravier', package = 'datamicroarray')
set.seed(676571)
cancer_data=mutate(as_tibble(pluck(gravier,"x")),y=pluck(gravier,"y"),pt_id=1:length(pluck(gravier, "y")),age=round(rnorm(length(pluck(gravier,"y")),mean=55,sd=10),1))
cancer_data=rename(cancer_data,event_label=y)
cancer_data$age_group=cut(cancer_data$age,breaks=seq(10,100,by=10))
Now we have the data set as an tibble, which is an augmented data frame (we will also get to that later):
select(slice(cancer_data,1:8),pt_id,age,age_group,event_label,1:5)
## # A tibble: 8 x 9
## pt_id age age_group event_label g2E09 g7F07 g1A01 g3C09
## <int> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 34.2 (30,40] good -0.00144 -0.00144 -0.0831 -0.0475
## 2 2 47 (40,50] good -0.0604 0.0129 -0.00144 0.0104
## 3 3 60.3 (60,70] good 0.0398 0.0524 -0.0786 0.0635
## 4 4 57.8 (50,60] good 0.0101 0.0314 -0.0218 0.0215
## 5 5 54.9 (50,60] good 0.0496 0.0201 0.0370 0.0311
## 6 6 58.8 (50,60] good -0.0664 0.0468 0.00720 -0.370
## 7 7 52.9 (50,60] good -0.00289 -0.0816 -0.0291 -0.0249
## 8 8 74.5 (70,80] good -0.198 -0.0499 -0.0634 -0.0298
## # … with 1 more variable: g3H08 <dbl>
This is just the first 8 rows and the first 8 columns, so
Hint: Where did the data come from?
Hint: Do you think you are the first person in the world to try to find out how many rows and columns are in a data set in R
?
The general syntax for a basic ggplot is:
ggplot(data = my_data,
mapping = aes(x = variable_1_name, y = variable_2_name)) +
geom_something() +
labs()
Note the +
for adding layers to the plot
ggplot
the plotting functionmy_data
the data you want to plotaes()
the mappings of your data to the plotx
data for the x-axisy
data for the y-axisgeom_something()
the representation of your datalabs()
the x-/y-labels, title, etc.A very handy cheat-sheet can be found here
Remember to write notes in your rmarkdown document. You will likely revisit these basic plots in future exercises.
Primer: Plotting 2 x 20 random normally distributed numbers, can be done like so:
ggplot(data = tibble(x = rnorm(20), y = rnorm(20)),
mapping = aes(x = x, y = y)) +
geom_point()
Using this small primer, the materials you read for today and the cancer_data
you created, in separate code-chunks, create a:
x = "my_gene"
in aes()
)Remember to write notes to yourself, so you know what you did and if there is something in particular you want to remember.
x
and y?
T6: Pick your favourite gene and create a boxplot of expression levels stratified on the variable event_label
T7: Like T6, but with densitograms
age_group
event_label
age
event_label
event_label
age_group
event_label
title of the legendT11: Recreate the following plot
Q5: Using your biological knowledge, what is your interpretation of the plot?
T12: Recreate the following plot
Q6: Using your biological knowledge, what is your interpretation of the plot?
T13: If you arrive here and there is still time left for the exercises, you are probably already familiar with ggplot
- Use what time is left to challenge yourself to further explore the cancer_data
and create some nice data visualisations - Show me what you come up with!