Getting Started

First of all, make sure to read every line in these exercises carefully!

For these exercises, work 2-and-2 in your desk groups. If you get stuck, go visit the chapters you prepared for today or try to practise your google’ing skills. I will of course also be circulating and available for questions.

First, go to your RStudio Cloud session from last time and login and choose the project you created.

Recall the layout of the IDE (Integrated Development Environment)

Then, before we start, if you did not install the tidyverse package last time (or for some reason created a new project), run the following line of code in your console:


Once this has been installed after ~5 minutes, create a new rmarkdown document (File -> New File -> R Markdown…)

Recall the syntax for a new code chunk, where all your R code goes any text and notes must be outside the chunk tags:

# Here goes the code... Note how this part does not get executed

Now, in your new rmarkdown document, add a new code chunk with the command


And run the chunk. This will load our data science toolbox, including dplyr (and ggplot)

Here follows a few handy short cuts:

Insert new code chunk: - Mac: CTRL + OPTION + i - Windows: CTRL + OPTION + i

Knit my rmarkdown document - Mac: CMD + SHIFT + k - Win: CTRL + SHIFT + k

Run line in chunk - Mac: CMD + ENTER - Win: CTRL + ENTER

Run entire chunk - Mac: CMD + SHIFT + ENTER - Win: CTRL + SHIFT + ENTER

Insert the pipe symbol %>% - Mac: CMD + SHIFT + m - Win: CTRL + SHIFT + m

A few initial questions

In your desk group, discuss the following primer questions. Note, when I ask for “what is the ouput”, do not run the code in the console, instead try to talk and think about it and write your answers and notes in your rmarkdown document for the day:

What is the output of:

  • Q1: tibble(x = c(4, 3, 5, 1, 2)) %>% filter(x > 2)

  • Q2: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(x)

  • Q3: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(desc(x))?

  • Q4: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(desc(desc(x)))?

  • Q5: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(x)?

  • Q6: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(y)?

  • Q7: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(-x)?

  • Q8: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(-x, -y)?

  • Q9: tibble(x = c(4, 3, 5, 1, 2)) %>% mutate(x_dbl = 2*x)

  • Q10: tibble(x = c(4, 3, 5, 1, 2)) %>% mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)

  • Q11: tibble(x = c(4, 3, 5, 1, 2)) %>% summarise(x_mu = mean(x))

  • Q12: tibble(x = c(4, 3, 5, 1, 2)) %>% summarise(x_max = max(x))

  • Q13: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) %>% group_by(lbl) %>% summarise(x_mu = mean(x), x_max = max(x))

  • Q14: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) %>% group_by(lbl) %>% summarise(n = n())

  • Q15: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) %>% count(lbl)

In the following, return to these questions and your answers for reference on the dplyr verbs

Load data

  1. Go to the Vanderbilt Biostatistics Data Sets site
  2. Find Diabetes data and download the diabetes.csv file
  3. Go to your project
  4. In the Files pane, click the New Folder button, enter folder name data and click ok
  5. Now, click on the folder you created
  6. Click the Upload button and navigate to the diabetes.csv file you downloaded in step 2
  7. Clicking the two dots .. above the file you uploaded, will take you back to your project root
  8. Insert a new code chunk in your rmarkdown document
  9. Add and run the following code
diabetes_data <- read_csv(file = "data/diabetes.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   chol = col_double(),
##   stab.glu = col_double(),
##   hdl = col_double(),
##   ratio = col_double(),
##   glyhb = col_double(),
##   location = col_character(),
##   age = col_double(),
##   gender = col_character(),
##   height = col_double(),
##   weight = col_double(),
##   frame = col_character(),
##   bp.1s = col_double(),
##   bp.1d = col_double(),
##   bp.2s = col_double(),
##   bp.2d = col_double(),
##   waist = col_double(),
##   hip = col_double(),
##   time.ppn = col_double()
## )
## # A tibble: 403 x 19
##       id  chol stab.glu   hdl ratio glyhb location   age gender height
##    <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>    <dbl> <chr>   <dbl>
##  1  1000   203       82    56  3.60  4.31 Bucking…    46 female     62
##  2  1001   165       97    24  6.90  4.44 Bucking…    29 female     64
##  3  1002   228       92    37  6.20  4.64 Bucking…    58 female     61
##  4  1003    78       93    12  6.5   4.63 Bucking…    67 male       67
##  5  1005   249       90    28  8.90  7.72 Bucking…    64 male       68
##  6  1008   248       94    69  3.60  4.81 Bucking…    34 male       71
##  7  1011   195       92    41  4.80  4.84 Bucking…    30 male       69
##  8  1015   227       75    44  5.20  3.94 Bucking…    37 male       59
##  9  1016   177       87    49  3.60  4.84 Bucking…    45 male       69
## 10  1022   263       89    40  6.60  5.78 Bucking…    55 female     63
## # … with 393 more rows, and 9 more variables: weight <dbl>, frame <chr>,
## #   bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>, bp.2d <dbl>, waist <dbl>,
## #   hip <dbl>, time.ppn <dbl>

Work with the diabetes data set

Use the View() function to inspect the data set

  • Q1: How many observations and how many variables?
  • Q2: Is this a tidy data set? Which three rules must be satisfied?
  • Q3: When you run the chunk, then underneath each column name is stated <chr> and <dbl> what is that?

Before we continue

  • T1: Change the height, weight, waist and hip from inches/pounds to the metric system (cm/kg), rounding to 1 decimal

Let us try to take a closer look at the data by various subsetting (How many… is equal to the number of rows in the subset of the data you created):

  • Q4: How many weigh less than 100kg?
  • Q5: How many weigh more than 100kg?
  • Q6: How many weigh more than 100kg and are less than 1.6m tall?
  • Q7: How many women are taller than 1.8m?
  • Q8: How many men are taller than 1.8m?
  • Q9: How many women in Louisa are older than 30?
  • Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?

  • T2: Make a scatter plot of weight versus height and colour by gender for inhabitants of Louisa above the age of 40
  • T3: Make a boxplot of height versus location stratified on gender for people above the age of 50

Sorting columns can aid in getting an overview of variable ranges (don’t use the summary function yet for this one)

  • Q11: How old is the youngest person?
  • Q12: How old is the oldest person?
  • Q13: Of all the 20-year olds, what is the height of the tallest?
  • Q14: Of all the 20-year olds, what is the height of the shortest?

Choosing specific columns can be used to work with a subset of the data for a specific purpose

  • Q15: How many columns (variables) starts_with a “b”?
  • Q16: How many columns (variables) contains the word “eight”?

Creating new variables is an integral part of data manipulation

  • T4: Create a new variable, where you calculate the BMI

  • T5: Create a BMI_class variable

This is tricky, go read about BMI classification here, then take a look at the following code snippet to get you started:

tibble(x = rnorm(100)) %>% 
  mutate(trichotomised = case_when(x < -1 ~ "Less than -1",
                                   -1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
                                   1 <= x ~ "Larger than or equal to 1"))
## # A tibble: 100 x 2
##          x trichotomised                                
##      <dbl> <chr>                                        
##  1  0.663  larger than or equal to -1 and smaller than 1
##  2 -0.833  larger than or equal to -1 and smaller than 1
##  3  1.22   Larger than or equal to 1                    
##  4 -0.202  larger than or equal to -1 and smaller than 1
##  5 -0.971  larger than or equal to -1 and smaller than 1
##  6 -0.0565 larger than or equal to -1 and smaller than 1
##  7 -0.132  larger than or equal to -1 and smaller than 1
##  8  0.405  larger than or equal to -1 and smaller than 1
##  9  0.0246 larger than or equal to -1 and smaller than 1
## 10  2.66   Larger than or equal to 1                    
## # … with 90 more rows

Use the following class labels:

  • underweight
  • normal weight
  • overweight
  • obese
  • severe obesity
  • morbid obesity
  • super obese

Once you have created the variable, you will need to convert it to a categorical variable, in R, these are called a factor and you can set the levels like so:

diabetes_data <- diabetes_data %>%
  mutate(BMI_class = factor(BMI_class,
                            levels =  c("underweight", "normal weight", "overweight", "obese",
                                        "severe obesity", "morbid obesity", "super obese")))

This is very important for plotting, as this will determine the order in which the categories appear on the plot!

  • T6: Create a boxplot of hdl versus BMI_class

  • Q17: What do you see?

  • T7: Create a BFP (Body fat percentage) variable (Hint: Revisit this link and think about if it would make sense to create a new representation of gender)

  • T8: Create a WHR (waist-to-hip ratio) variable
  • Q18: Which correlate better with BMI, WHR or BFP? (Hint: Create two approproate seperate scatter plots and add a statistical smoother using geom_smooth(method = "lm"))

Now, with this augmented data set, let us create some summary statistics

  • Q19: How many women and men are there in the data set?
  • Q20: How many women and men are there from Buckingham and Louise respectively in the data set?
  • Q21: How many are in each of the BMI_class groups?

  • Q22: Given the code below, explain the difference between A and B?

# A
diabetes_data %>%
  ggplot(aes(x = BMI_class)) +

# B
diabetes_data %>%
  count(BMI_class) %>%
  ggplot(aes(x = BMI_class, y = n)) +
  • T9: For each BMI_class group, calculate the average weight and associated standard deviation

  • Q23: What was the average age of the women living in Buckingham in the study?

Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.