Intermezzo

Say you wanted to use this in a public presentation, but due to copyrights, that was not allowed - Perhaps you could use ggplot?

  • T1: Spend ~10-15 minutes (no more!) with your desck buddy on discussing and re-creating the overall components of the above plot (It’s okay to hardcode here) source and info, (I will show you the full code for doing this next time).

How to organise a project

Recall this figure? It is inspired by the Josh Reich’s Load-clean-func-do-thought and this 2009 paper by William Stafford Noble

Let us use it as a point of reference in the following excluding the doc-part with collecting in an rmarkdown for now.

Setup Project Directory

  • Log on to your RStudio Cloud session
  • Create a BRAND NEW project, i.e. do not do these exercises in your session with all your rmarkdown documents
  • Note, this also means, that you will have to reinstall tidyverse and any other packages you may need
  • T2: In your new project, create the below directory structure
project_root
  |
  + data
    |
    + _raw
  |
  + doc
  |
  + R
  |
  + results

Setup Project Script Files

Now, we need to add some files. Note, for now we will leave the rmarkdown and instead turn to scripting. Think of it like condensing the rmarkdown, so that it only contains code. This format is useful for execution on HPC-systems. The following is a general (recommended) layout of an R-script. Note, how anything, which is not code has to be preceeded by an #

00_doit.R

# Run all scripts
# ------------------------------------------------------------------------------
source(file = "R/01_load.R")
source(file = "R/02_clean.R")
source(file = "R/03_augment.R")
source(file = "R/04_analysis_i.R")

01_load.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_raw <- read_tsv(file = "data/_raw/my_raw_data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data <- my_data_raw # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data,
          path = "data/01_my_data.tsv")

02_clean.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "data/01_my_data.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean <- my_data # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean,
          path = "data/02_my_data_clean.tsv")

03_augment.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_clean <- read_tsv(file = "data/02_my_data_clean.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug <- my_data_clean # %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean_aug,
          path = "data/03_my_data_clean_aug.tsv")

04_analysis_i.R

# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())

# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")

# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")

# Load data
# ------------------------------------------------------------------------------
my_data_clean_aug <- read_tsv(file = "data/03_my_data_clean_aug.tsv")

# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Model data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Visualise data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...

# Write data
# ------------------------------------------------------------------------------
write_tsv(...)
ggsave(...)

99_project_functions.R

# Define project functions
# ------------------------------------------------------------------------------
foo <- function(x){
  return(2*x)
}
bar <- function(x){
  return(x^2)
}
...

Now, this is as generalised as possible. The next step is to fill in some data in "data/_raw/" and then commense with setting up your analysis

  • T5: In the last exercise from week 5, you were supposed to make a PCA/clustering analysis. Now, refit that code into the above framework (gravier can be found here)

When the project has been setup correctly, you should be able to delete everything except the raw data and your script-files (See visualisation at the start of these exercises) and then run the entire project by opening 00_doit.R and sourcing cmd/ctrl+shift+s. You can also run this in the commandline allowing automation and integration into HPC environment. Simply switch to the terminal tab in the console pane and issue the command Rscript R/00_doit.R. Remember, you may need to know the absolute location of Rscript in an HPC environment, you can get this by running which Rscript in the terminal.

  • T6: Make sure you can run 00_doit.R i.e. your entire project without errors, then delete and re-run as described above.

Using RStudio with git

The following will be quite cook-book like, but stay tuned! I can gurantee you that if you get into industry as a data scientist, you will encounter git!

Getting started

If you go to this GitHub site, you will find the official R-for-Bio-Data-Science GitHub site. Before we continue, if you do not have a GitHub account, go create your own account. Then, you must via a personal slack-message to me send me your github username. I will then invite you to the organisation.

Connecting RStudio Cloud with git

  1. Go to the R-for-Bio-Data-Science GitHub site
  2. Click the green New button in the upper right corner
  3. Name your repository like so: 2020_your_github_username (Make sure, that the owner of the repo is rforbiodatascience and not your own GitHub profile)
  4. Leave anything else as is (default)
  5. Click Create repository at the bottom of the page
  6. Now, leave the page you are at open
  7. Return to your RStudio Cloud session and the project framework you worked with previously
  8. Find the console pane and click the Terminal pane, this will present you with a full shell
  9. Run each line from the code box below separately by copy/pasting into the Terminal pane be sure to exchange my example credentials for your own
  10. Now return to the github page, on the top it’ll say rforbiodatascience / 2020_username, click on 2020_username
  11. Inspect your new GitHub repository, which should hold one file README.md and the header 2020_username
  12. If so - Congratulations, you have now succesfully connected your RStudio cloud session with your GitHub repository
  13. Return to your RStudio cloud session
  14. Now, we need to restart session in order for the changes to take place, in the upper right corner your name will be in grey and to the left of that are three dots in a circle - Click on the dots
  15. Click Relunch Project
  16. Click OK
  17. In the environment pane, you should now see a new tab Git and in your files pane, you should see a new file appear: .gitignore, we will get back to both of these later
# Codebox for setting up git:
git init
git config user.email "your_github_email_address"
git config user.name "your_github_user_name"
echo "# 2020_your_github_user_name" > README.md
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/rforbiodatascience/2020_your_github_user_name.git
git push -u origin master
Username for 'https://github.com': your_github_user_name # You will be prompted here
Password for 'https://your_github_user_name@github.com': #  Use your own user/pass

Ignore the error: error: cannot run rpostback-askpass: No such file or directory if you get it and make sure, that your email, username and password match your github credentials

Adding content to GitHub using git

  1. Click the green plus in the upper left corner and choose R Markdown...
  2. It will prompt you to install a bunch of packages, just click Yes
  3. You will then see the dialogue New R Markdown, fill in the title Reproducible Data Analysis Framework Exercise and author and click OK
  4. In the header of the rmarkdown-document you just created, make sure to change output: html_document to output: github_document
  5. Start by saving the document as README.Rmd
  6. Now, delete everything below ## R Markdown and replace the auto-header ## R Markdown with ## Description
  7. Below ## Description add a few lines to descripe what is in the repository
  8. Knit the document
  9. In your RStudio cloud session, find the Environment pane. Here, you should have a new tab you have not seen before Git - Go ahead and click it
  10. Now you will see what is called the staging area
  11. Find the button saying Commit just above the list of files and click it
  12. A new dialogue opens and now, you will see the files in your project directory, which are different from your files in your github repo
  13. Find README.md, it should have a blue M to the left of it and further on the left under Staged there is a tick-box - Tick it
  14. Inspect the changes below “old” is in red and “new” is in green
  15. In the box saying Commit message write e.g. Update README and click the button Commit just below the box
  16. This will yield some Git Commit messages, just click Close
  17. In the upper right corner, click pull, this should result in a Git Pull message saying Already up-to-date, click Close once again (pull = pull stuff down from the GitHub server)
  18. In the upper right ccorner, click push and enter your username and password. This should result in a Git Push message saying something along the lines of HEAD -> master, click Close once again (push = push stuff up to the GitHub server)
  19. Close the dialogue window
  20. Close the commit window
  21. Go to your GitHub repo and refresh the page, now you should see the new README you created
  22. Go back to your RStudio Cloud session and find the file .gitignore, click it and add the lines .gitignore and project.Rproj, when you save this file, you should see the files corresponding to the lines you entered disappear from the staging area (This is how you control which files should end up in your GitHub repo)
  23. Now, add ALL the files in the staging area (either use the gui or use the terminal and the command git add .)
  24. Commit the files with the message first full project commit and then repeat the aforementioned pull/push
  25. Go to your GitHub repo and check, that everything went as planned
  26. Go back to RStudio, choose one of your files and change something (e.g. add a comment), then pull/push and again check if you can see your change in your GitHub repo.

Warning: Git merge conflicts can get complicated, one remedy is to always pull before you push push

That’s it for today!