Say you wanted to use this in a public presentation, but due to copyrights, that was not allowed - Perhaps you could use ggplot
?
Recall this figure? It is inspired by the Josh Reich’s Load-clean-func-do-thought and this 2009 paper by William Stafford Noble
Let us use it as a point of reference in the following excluding the doc
-part with collecting in an rmarkdown for now.
tidyverse
and any other packages you may needproject_root
|
+ data
|
+ _raw
|
+ doc
|
+ R
|
+ results
Now, we need to add some files. Note, for now we will leave the rmarkdown and instead turn to scripting. Think of it like condensing the rmarkdown, so that it only contains code. This format is useful for execution on HPC-systems. The following is a general (recommended) layout of an R-script. Note, how anything, which is not code has to be preceeded by an #
# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())
# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
# Define functions
# ------------------------------------------------------------------------------
foo <- function(x){
return("bar")
}
# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "path/to/my/data.tsv")
# Wrangle data
# ------------------------------------------------------------------------------
my_data_subset <- my_data %>%
filter(...) %>%
select(...) %>%
mutate(...) %>%
arrange(...)
# Visualise
# ------------------------------------------------------------------------------
pl1 <- my_data_subset %>%
ggplot(aes(x = var_1, y = var_2)) +
geom_point() +
theme_bw()
# Write data
# ------------------------------------------------------------------------------
ggsave(filename = "path/to/my/results/plot.png",
plot = pl1,
width = 10,
height = 6)
write_tsv(x = my_data_subset,
path = "path/to/my/data_subset.tsv")
With respect to the formatting of scripts, it is highly recommended to follow the The tidyverse style guide.
.R
and not .Rmd
)project_root
|
+ data
|
+ _raw
|
+ doc
|
+ R
+ 00_doit.R
+ 01_load.R
+ 02_clean.R
+ 03_augment.R
+ 04_analysis_i.R
+ 99_project_functions.R
|
+ results
# Run all scripts
# ------------------------------------------------------------------------------
source(file = "R/01_load.R")
source(file = "R/02_clean.R")
source(file = "R/03_augment.R")
source(file = "R/04_analysis_i.R")
# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())
# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")
# Load data
# ------------------------------------------------------------------------------
my_data_raw <- read_tsv(file = "data/_raw/my_raw_data.tsv")
# Wrangle data
# ------------------------------------------------------------------------------
my_data <- my_data_raw # %>% ...
# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data,
path = "data/01_my_data.tsv")
# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())
# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")
# Load data
# ------------------------------------------------------------------------------
my_data <- read_tsv(file = "data/01_my_data.tsv")
# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean <- my_data # %>% ...
# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean,
path = "data/02_my_data_clean.tsv")
# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())
# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")
# Load data
# ------------------------------------------------------------------------------
my_data_clean <- read_tsv(file = "data/02_my_data_clean.tsv")
# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug <- my_data_clean # %>% ...
# Write data
# ------------------------------------------------------------------------------
write_tsv(x = my_data_clean_aug,
path = "data/03_my_data_clean_aug.tsv")
# Clear workspace
# ------------------------------------------------------------------------------
rm(list = ls())
# Load libraries
# ------------------------------------------------------------------------------
library("tidyverse")
# Define functions
# ------------------------------------------------------------------------------
source(file = "R/99_project_functions.R")
# Load data
# ------------------------------------------------------------------------------
my_data_clean_aug <- read_tsv(file = "data/03_my_data_clean_aug.tsv")
# Wrangle data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...
# Model data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...
# Visualise data
# ------------------------------------------------------------------------------
my_data_clean_aug %>% ...
# Write data
# ------------------------------------------------------------------------------
write_tsv(...)
ggsave(...)
# Define project functions
# ------------------------------------------------------------------------------
foo <- function(x){
return(2*x)
}
bar <- function(x){
return(x^2)
}
...
Now, this is as generalised as possible. The next step is to fill in some data in "data/_raw/"
and then commense with setting up your analysis
When the project has been setup correctly, you should be able to delete everything except the raw data and your script-files (See visualisation at the start of these exercises) and then run the entire project by opening 00_doit.R
and sourcing cmd/ctrl+shift+s
. You can also run this in the commandline allowing automation and integration into HPC environment. Simply switch to the terminal tab in the console pane and issue the command Rscript R/00_doit.R
. Remember, you may need to know the absolute location of Rscript
in an HPC environment, you can get this by running which Rscript
in the terminal.
00_doit.R
i.e. your entire project without errors, then delete and re-run as described above.The following will be quite cook-book like, but stay tuned! I can gurantee you that if you get into industry as a data scientist, you will encounter git!
If you go to this GitHub site, you will find the official R-for-Bio-Data-Science GitHub site. Before we continue, if you do not have a GitHub account, go create your own account. Then, you must via a personal slack-message to me send me your github username. I will then invite you to the organisation.
New
button in the upper right corner2020_your_github_username
(Make sure, that the owner of the repo is rforbiodatascience
and not your own GitHub profile)Create repository
at the bottom of the pageTerminal
pane, this will present you with a full shellTerminal
pane be sure to exchange my example credentials for your ownrforbiodatascience / 2020_username
, click on 2020_username
README.md
and the header 2020_username
Relunch Project
OK
environment
pane, you should now see a new tab Git
and in your files pane, you should see a new file appear: .gitignore
, we will get back to both of these later# Codebox for setting up git:
git init
git config user.email "your_github_email_address"
git config user.name "your_github_user_name"
echo "# 2020_your_github_user_name" > README.md
git add README.md
git commit -m "first commit"
git remote add origin https://github.com/rforbiodatascience/2020_your_github_user_name.git
git push -u origin master
Username for 'https://github.com': your_github_user_name # You will be prompted here
Password for 'https://your_github_user_name@github.com': # Use your own user/pass
Ignore the error: error: cannot run rpostback-askpass: No such file or directory
if you get it and make sure, that your email, username and password match your github credentials
git
R Markdown...
Yes
New R Markdown
, fill in the title Reproducible Data Analysis Framework Exercise
and author and click OK
output: html_document
to output: github_document
README.Rmd
## R Markdown
and replace the auto-header ## R Markdown
with ## Description
## Description
add a few lines to descripe what is in the repositoryEnvironment
pane. Here, you should have a new tab you have not seen before Git
- Go ahead and click itCommit
just above the list of files and click itREADME.md
, it should have a blue M
to the left of it and further on the left under Staged
there is a tick-box - Tick itCommit message
write e.g. Update README
and click the button Commit
just below the boxGit Commit
messages, just click Close
pull
, this should result in a Git Pull
message saying Already up-to-date
, click Close
once again (pull
= pull stuff down from the GitHub server)push
and enter your username
and password
. This should result in a Git Push
message saying something along the lines of HEAD -> master
, click Close
once again (push
= push stuff up to the GitHub server)README
you created.gitignore
, click it and add the lines .gitignore
and project.Rproj
, when you save this file, you should see the files corresponding to the lines you entered disappear from the staging area (This is how you control which files should end up in your GitHub repo)git add .
)first full project commit
and then repeat the aforementioned pull/push
pull/push
and again check if you can see your change in your GitHub repo.Warning: Git merge conflicts can get complicated, one remedy is to always pull
before you push push
That’s it for today!