First of all, make sure to read every line in these exercises carefully!
For these exercises, work 2-and-2 in your desk groups. If you get stuck, go visit the chapters you prepared for today or try to practise your google’ing skills. I will of course also be circulating and available for questions.
First, go to your RStudio Cloud session from last time and login and choose the project you created.
Recall the layout of the IDE (Integrated Development Environment)
Then, before we start, if you did not install the tidyverse
package last time (or for some reason created a new project), run the following line of code in your console:
install.packages("tidyverse")
Once this has been installed after ~5 minutes, create a new rmarkdown document (File -> New File -> R Markdown…)
Recall the syntax for a new code chunk, where all your R
code goes any text and notes must be outside the chunk tags:
```{r}
# Here goes the code... Note how this part does not get executed
```
Now, in your new rmarkdown document, add a new code chunk with the command
library("tidyverse")
And run the chunk. This will load our data science toolbox, including dplyr
(and ggplot
)
Insert new code chunk: - Mac: CTRL + OPTION + i - Windows: CTRL + OPTION + i
Knit my rmarkdown document - Mac: CMD + SHIFT + k - Win: CTRL + SHIFT + k
Run line in chunk - Mac: CMD + ENTER - Win: CTRL + ENTER
Run entire chunk - Mac: CMD + SHIFT + ENTER - Win: CTRL + SHIFT + ENTER
Insert the pipe symbol %>%
- Mac: CMD + SHIFT + m - Win: CTRL + SHIFT + m
In your desk group, discuss the following primer questions. Note, when I ask for “what is the ouput”, do not run the code in the console, instead try to talk and think about it and write your answers and notes in your rmarkdown document for the day:
What is the output of:
Q1: tibble(x = c(4, 3, 5, 1, 2)) %>% filter(x > 2)
Q2: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(x)
Q3: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(desc(x))
?
Q4: tibble(x = c(4, 3, 5, 1, 2)) %>% arrange(desc(desc(x)))
?
Q5: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(x)
?
Q6: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(y)
?
Q7: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(-x)
?
Q8: tibble(x = c(4, 3, 5, 1, 2), y = c(2, 4, 3, 5, 1)) %>% select(-x, -y)
?
Q9: tibble(x = c(4, 3, 5, 1, 2)) %>% mutate(x_dbl = 2*x)
Q10: tibble(x = c(4, 3, 5, 1, 2)) %>% mutate(x_dbl = 2 * x, x_qdr = 2*x_dbl)
Q11: tibble(x = c(4, 3, 5, 1, 2)) %>% summarise(x_mu = mean(x))
Q12: tibble(x = c(4, 3, 5, 1, 2)) %>% summarise(x_max = max(x))
Q13: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, NA, 5, 1, 2)) %>% group_by(lbl) %>% summarise(x_mu = mean(x), x_max = max(x))
Q14: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) %>% group_by(lbl) %>% summarise(n = n())
Q15: tibble(lbl = c("A", "A", "B", "B", "C"), x = c(4, 3, 5, 1, 2)) %>% count(lbl)
In the following, return to these questions and your answers for reference on the dplyr
verbs
diabetes.csv
fileFiles
pane, click the New Folder
button, enter folder name data
and click ok
Upload
button and navigate to the diabetes.csv
file you downloaded in step 2..
above the file you uploaded, will take you back to your project rootdiabetes_data <- read_csv(file = "data/diabetes.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## chol = col_double(),
## stab.glu = col_double(),
## hdl = col_double(),
## ratio = col_double(),
## glyhb = col_double(),
## location = col_character(),
## age = col_double(),
## gender = col_character(),
## height = col_double(),
## weight = col_double(),
## frame = col_character(),
## bp.1s = col_double(),
## bp.1d = col_double(),
## bp.2s = col_double(),
## bp.2d = col_double(),
## waist = col_double(),
## hip = col_double(),
## time.ppn = col_double()
## )
diabetes_data
## # A tibble: 403 x 19
## id chol stab.glu hdl ratio glyhb location age gender height
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1000 203 82 56 3.60 4.31 Bucking… 46 female 62
## 2 1001 165 97 24 6.90 4.44 Bucking… 29 female 64
## 3 1002 228 92 37 6.20 4.64 Bucking… 58 female 61
## 4 1003 78 93 12 6.5 4.63 Bucking… 67 male 67
## 5 1005 249 90 28 8.90 7.72 Bucking… 64 male 68
## 6 1008 248 94 69 3.60 4.81 Bucking… 34 male 71
## 7 1011 195 92 41 4.80 4.84 Bucking… 30 male 69
## 8 1015 227 75 44 5.20 3.94 Bucking… 37 male 59
## 9 1016 177 87 49 3.60 4.84 Bucking… 45 male 69
## 10 1022 263 89 40 6.60 5.78 Bucking… 55 female 63
## # … with 393 more rows, and 9 more variables: weight <dbl>, frame <chr>,
## # bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>, bp.2d <dbl>, waist <dbl>,
## # hip <dbl>, time.ppn <dbl>
Use the View()
function to inspect the data set
<chr>
and <dbl>
what is that?Before we continue
height
, weight
, waist
and hip
from inches/pounds to the metric system (cm/kg), rounding to 1 decimalLet us try to take a closer look at the data by various subsetting (How many… is equal to the number of rows in the subset of the data you created):
Q10: How many men in Buckingham are younger than 30 and taller than 1.9m?
weight
versus height
and colour by gender for inhabitants of Louisa above the age of 40T3: Make a boxplot of height versus location stratified on gender for people above the age of 50
Sorting columns can aid in getting an overview of variable ranges (don’t use the summary function yet for this one)
Choosing specific columns can be used to work with a subset of the data for a specific purpose
starts_with
a “b”?contains
the word “eight”?Creating new variables is an integral part of data manipulation
T4: Create a new variable, where you calculate the BMI
T5: Create a BMI_class
variable
This is tricky, go read about BMI classification here, then take a look at the following code snippet to get you started:
tibble(x = rnorm(100)) %>%
mutate(trichotomised = case_when(x < -1 ~ "Less than -1",
-1 <= x & x < 1 ~ "larger than or equal to -1 and smaller than 1",
1 <= x ~ "Larger than or equal to 1"))
## # A tibble: 100 x 2
## x trichotomised
## <dbl> <chr>
## 1 0.663 larger than or equal to -1 and smaller than 1
## 2 -0.833 larger than or equal to -1 and smaller than 1
## 3 1.22 Larger than or equal to 1
## 4 -0.202 larger than or equal to -1 and smaller than 1
## 5 -0.971 larger than or equal to -1 and smaller than 1
## 6 -0.0565 larger than or equal to -1 and smaller than 1
## 7 -0.132 larger than or equal to -1 and smaller than 1
## 8 0.405 larger than or equal to -1 and smaller than 1
## 9 0.0246 larger than or equal to -1 and smaller than 1
## 10 2.66 Larger than or equal to 1
## # … with 90 more rows
Use the following class labels:
Once you have created the variable, you will need to convert it to a categorical variable, in R
, these are called a factor
and you can set the levels like so:
diabetes_data <- diabetes_data %>%
mutate(BMI_class = factor(BMI_class,
levels = c("underweight", "normal weight", "overweight", "obese",
"severe obesity", "morbid obesity", "super obese")))
This is very important for plotting, as this will determine the order in which the categories appear on the plot!
T6: Create a boxplot of hdl
versus BMI_class
Q17: What do you see?
T7: Create a BFP
(Body fat percentage) variable (Hint: Revisit this link and think about if it would make sense to create a new representation of gender
)
WHR
(waist-to-hip ratio) variableQ18: Which correlate better with BMI
, WHR
or BFP
? (Hint: Create two approproate seperate scatter plots and add a statistical smoother using geom_smooth(method = "lm")
)
Now, with this augmented data set, let us create some summary statistics
Q21: How many are in each of the BMI_class
groups?
Q22: Given the code below, explain the difference between A and B?
# A
diabetes_data %>%
ggplot(aes(x = BMI_class)) +
geom_bar()
# B
diabetes_data %>%
count(BMI_class) %>%
ggplot(aes(x = BMI_class, y = n)) +
geom_col()
T9: For each BMI_class
group, calculate the average weight and associated standard deviation
Q23: What was the average age of the women living in Buckingham in the study?
Finally, if you reach this point and there is still time left. Take some time to do some exploratory plots of the data set and see if you can find something interesting.