March 2nd 2020

Agenda

  • 08.00 - 08.30 Recap of exercises from last class
  • 08.30 - 09.00 Introduction to Scripting in a Reproducible and Collaborative Framework using GitHub via RStudio
  • 09.00 - 12.00 Exercises in Modelling, dimension reduction and clustering

Scripting in a Reproducible and Collaborative Framework using GitHub via RStudio

Recall the project organisation visualisation

An R-script

  • Up until now, we have been working in Rmarkdown
  • However, in an High-Performance-Computing environment, you may want use scripting instead
  • Think of scripting versus Rmarkdown as condensing into only including code chunks

An R-script

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
foo <- function(x){ return("bar") }

# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "path/to/my/data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_subset <- my_data %>% #...

# Visualise
# ------------------------------------------------------------------------------
pl1 <- my_data_subset %>% ggplot(aes(x = var_1, y = var_2)) + geom_point()
# Write data
# ------------------------------------------------------------------------------
ggsave(filename = "path/to/my/results/plot.png", plot = pl1)
write_tsv(x = my_data_subset, path = "path/to/my/data_subset.tsv")

git

  • git is industry standard for code collaboration
  • Used most places in industry with an established data science infrastructure

git

Questions?