Intermezzo

Say you wanted to use this in a public presentation, but due to copyrights, that was not allowed - Perhaps you could use ggplot?

T1: Spend ~10-15 minutes (no more!) with your desck buddy on discussing and re-creating the overall components of the above plot (It’s okay to hardcode here) source and info, (I will show you the full code for doing this next time).

How to organise a project

Recall this figure? It is inspired by the Josh Reich’s Load-clean-func-do-thought and this 2009 paper by William Stafford Noble

Let us use it as a point of reference in the following excluding the doc-part with collecting in an rmarkdown for now.

Setup Project Directory

Log on to your RStudio Cloud session
Create a BRAND NEW project, i.e. do not do these exercises in your session with all your rmarkdown documents
Note, this also means, that you will have to reinstall tidyverse and any other packages you may need
T2: In your new project, create the below directory structure

project_root
  |
  + data
    |
    + _raw
  |
  + doc
  |
  + R
  |
  + results

Setup Project Script Files

Now, we need to add some files. Note, for now we will leave the rmarkdown and instead turn to scripting. Think of it like condensing the rmarkdown, so that it only contains code. This format is useful for execution on HPC-systems. The following is a general (recommended) layout of an R-script. Note, how anything, which is not code has to be preceeded by an #

General (recommended) layout of an R-script

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
foo <- function(x){
  return("bar")
}

# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "path/to/my/data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_subset <- my_data %>% 
  filter(...) %>% 
  select(...) %>% 
  mutate(...) %>% 
  arrange(...)

# Visualise
# ------------------------------------------------------------------------------
pl1 <- my_data_subset %>% 
  ggplot(aes(x = var_1, y = var_2)) +
  geom_point() +
  theme_bw()

# Write data
# ------------------------------------------------------------------------------
ggsave(filename = "path/to/my/results/plot.png",
       plot = pl1,
       width = 10,
       height = 6)
write_tsv(x = my_data_subset,
          path = "path/to/my/data_subset.tsv")

With respect to the formatting of scripts, it is highly recommended to follow the The tidyverse style guide.

T3: Add the following rscript files to your project (Note, the file-suffix is .R and not .Rmd)

project_root
  |
  + data
    |
    + _raw
  |
  + doc
  |
  + R
    + 00_doit.R
    + 01_load.R
    + 02_clean.R
    + 03_augment.R
    + 04_analysis_i.R
    + 99_project_functions.R
  |
  + results

T4: Now, add the following content to each script

00_doit.R

# Run all scripts
# ------------------------------------------------------------------------------
source(file = "R/01_load.R")
source(file = "R/02_clean.R")
source(file = "R/03_augment.R")
source(file = "R/04_analysis_i.R")

01_load.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_raw <- read_tsv(file = "data/_raw/my_raw_data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data <- my_data_raw # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data,
          path = "data/01_my_data.tsv")

02_clean.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "data/01_my_data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean <- my_data # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean,
          path = "data/02_my_data_clean.tsv")

03_augment.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_clean <- read_tsv(file = "data/02_my_data_clean.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug <- my_data_clean # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean_aug,
          path = "data/03_my_data_clean_aug.tsv")

04_analysis_i.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_clean_aug <- read_tsv(file = "data/03_my_data_clean_aug.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Model data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Visualise data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(...)
ggsave(...)

99_project_functions.R

# Define project functions
# ------------------------------------------------------------------------------
foo <- function(x){
  return(2*x)
}
bar <- function(x){
  return(x^2)
}
...

Now, this is as generalised as possible. The next step is to fill in some data in "data/_raw/" and then commense with setting up your analysis

T5: In the last exercise from week 5, you were supposed to make a PCA/clustering analysis. Now, refit that code into the above framework (gravier can be found here)

When the project has been setup correctly, you should be able to delete everything except the raw data and your script-files (See visualisation at the start of these exercises) and then run the entire project by opening 00_doit.R and sourcing cmd/ctrl+shift+s. You can also run this in the commandline allowing automation and integration into HPC environment. Simply switch to the terminal tab in the console pane and issue the command Rscript R/00_doit.R. Remember, you may need to know the absolute location of Rscript in an HPC environment, you can get this by running which Rscript in the terminal.

T6: Make sure you can run 00_doit.R i.e. your entire project without errors, then delete and re-run as described above.

Using RStudio with git

The following will be quite cook-book like, but stay tuned! I can gurantee you that if you get into industry as a data scientist, you will encounter git!

Getting started

If you go to this GitHub site, you will find the official R-for-Bio-Data-Science GitHub site. Before we continue, if you do not have a GitHub account, go create your own account. Then, you must via a personal slack-message to me send me your github username. I will then invite you to the organisation.

Connecting RStudio Cloud with git

Go to the R-for-Bio-Data-Science GitHub site
Click the green New button in the upper right corner
Name your repository like so: 2020_your_github_username (Make sure, that the owner of the repo is rforbiodatascience and not your own GitHub profile)
Leave anything else as is (default)
Click Create repository at the bottom of the page
Now, leave the page you are at open
Return to your RStudio Cloud session and the project framework you worked with previously
Find the console pane and click the Terminal pane, this will present you with a full shell
Run each line from the code box below separately by copy/pasting into the Terminal pane be sure to exchange my example credentials for your own
Now return to the github page, on the top it’ll say rforbiodatascience / 2020_username, click on 2020_username
Inspect your new GitHub repository, which should hold one file README.md and the header 2020_username
If so - Congratulations, you have now succesfully connected your RStudio cloud session with your GitHub repository
Return to your RStudio cloud session
Now, we need to restart session in order for the changes to take place, in the upper right corner your name will be in grey and to the left of that are three dots in a circle - Click on the dots
Click Relunch Project
Click OK
In the environment pane, you should now see a new tab Git and in your files pane, you should see a new file appear: .gitignore, we will get back to both of these later

# Codebox for setting up git:
git init
git config user.email "your_github_email_address"
git config user.name "your_github_user_name"
echo "# 2020_your_github_user_name" > README.md
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/rforbiodatascience/2020_your_github_user_name.git
git push -u origin master
Username for 'https://github.com': your_github_user_name # You will be prompted here
Password for 'https://your_github_user_name@github.com': #  Use your own user/pass

Ignore the error: error: cannot run rpostback-askpass: No such file or directory if you get it and make sure, that your email, username and password match your github credentials

Adding content to GitHub using `git`

Click the green plus in the upper left corner and choose R Markdown...
It will prompt you to install a bunch of packages, just click Yes
You will then see the dialogue New R Markdown, fill in the title Reproducible Data Analysis Framework Exercise and author and click OK
In the header of the rmarkdown-document you just created, make sure to change output: html_document to output: github_document
Start by saving the document as README.Rmd
Now, delete everything below ## R Markdown and replace the auto-header ## R Markdown with ## Description
Below ## Description add a few lines to descripe what is in the repository
Knit the document
In your RStudio cloud session, find the Environment pane. Here, you should have a new tab you have not seen before Git - Go ahead and click it
Now you will see what is called the staging area
Find the button saying Commit just above the list of files and click it
A new dialogue opens and now, you will see the files in your project directory, which are different from your files in your github repo
Find README.md, it should have a blue M to the left of it and further on the left under Staged there is a tick-box - Tick it
Inspect the changes below “old” is in red and “new” is in green
In the box saying Commit message write e.g. Update README and click the button Commit just below the box
This will yield some Git Commit messages, just click Close
In the upper right corner, click pull, this should result in a Git Pull message saying Already up-to-date, click Close once again (pull = pull stuff down from the GitHub server)
In the upper right ccorner, click push and enter your username and password. This should result in a Git Push message saying something along the lines of HEAD -> master, click Close once again (push = push stuff up to the GitHub server)
Close the dialogue window
Close the commit window
Go to your GitHub repo and refresh the page, now you should see the new README you created
Go back to your RStudio Cloud session and find the file .gitignore, click it and add the lines .gitignore and project.Rproj, when you save this file, you should see the files corresponding to the lines you entered disappear from the staging area (This is how you control which files should end up in your GitHub repo)
Now, add ALL the files in the staging area (either use the gui or use the terminal and the command git add .)
Commit the files with the message first full project commit and then repeat the aforementioned pull/push
Go to your GitHub repo and check, that everything went as planned
Go back to RStudio, choose one of your files and change something (e.g. add a comment), then pull/push and again check if you can see your change in your GitHub repo.

Warning: Git merge conflicts can get complicated, one remedy is to always pull before you push push

That’s it for today!

22100 - R for Bio Data Science

Week 6 - Exercises: Scripting in a Reproducible and Collaborative Framework using GitHub via RStudio

March 9th 2020

Intermezzo

How to organise a project

Setup Project Directory

Setup Project Script Files

General (recommended) layout of an R-script

00_doit.R

01_load.R

02_clean.R

03_augment.R

04_analysis_i.R

99_project_functions.R

Using RStudio with git

Getting started

Connecting RStudio Cloud with git

Adding content to GitHub using `git`

22100 - R for Bio Data Science

Week 6 - Exercises: Scripting in a Reproducible and Collaborative Framework using GitHub via RStudio

March 9th 2020

Intermezzo

How to organise a project

Setup Project Directory

Setup Project Script Files

General (recommended) layout of an R-script

00_doit.R

01_load.R

02_clean.R

03_augment.R

04_analysis_i.R

99_project_functions.R

Using RStudio with git

Getting started

Connecting RStudio Cloud with git

Adding content to GitHub using git

Adding content to GitHub using `git`