Data Wrangling with dplyr

A Presentation for Weill Cornell Medicine’s Biostatistics Computing Club

Image courtesy of Allison Horst’s Twitter: @allison_horst

Introduction

Why dplyr?

  • Powerful but efficient

    • Consistent syntax

    • Fast

  • Function chaining

    • Works well with entire tidyverse suite
  1. Efficiency*

  2. Simple syntax

  3. Function chaining

  4. Ability to analyze external databases

  5. Works well with other packages in tidyverse suite
    1. ggplot2
    2. tidyr
    3. stringr
    4. forcats
    5. purrr

*if you start dealing with data sets with > 1 million rows, data.table will be much faster

data("iris")
library(tidyverse)
## ── Attaching packages ── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Tibbles

  • An “update” to the data.frame object class in R

  • Updates relevant for using dplyr:

    • Vectors of length 1 are automatically recycled

    • Newly created vectors can be referenced in the same line of code

  • Other perks: only the first 10 lines print to the screen so your console doesn’t get overloaded

  • Read more about tibbles here: https://r4ds.had.co.nz/tibbles.html

  • dplyr automatically converts objects to tbl_df (tibble data frame) objects

Piping

  • %>% operator from library(magitrr)

  • Use %>% to send an object (typically a dataframe) to the next function

  • The function you pipe to will use the object in front of the %>% as its first argument

iris %>% head() # equivalent to head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris %>% head(n=3) # equivalent to head(iris, n=3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
  • If you don’t want it to be the first argument, use .
iris %>% lm(Sepal.Width ~ Sepal.Length, data=.)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = .)
## 
## Coefficients:
##  (Intercept)  Sepal.Length  
##      3.41895      -0.06188
  • Shortcut for %>% is CTRL + SHIFT + M (or CMD + SHIFT + M for OSX)

  • A function that takes a data frame as the first argument, eg. head(), is called a verb

  • The entire tidyverse suite operates under the verb function structure, making piping especially convenient

Main verbs

arrange()

  • Sort data frame by column(s), lowest to highest
arrange(iris, Petal.Length) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           4.6         3.6          1.0         0.2  setosa
## 2           4.3         3.0          1.1         0.1  setosa
## 3           5.8         4.0          1.2         0.2  setosa
## 4           5.0         3.2          1.2         0.2  setosa
## 5           4.7         3.2          1.3         0.2  setosa
## 6           5.4         3.9          1.3         0.4  setosa
## 7           5.5         3.5          1.3         0.2  setosa
## 8           4.4         3.0          1.3         0.2  setosa
## 9           5.0         3.5          1.3         0.3  setosa
## 10          4.5         2.3          1.3         0.3  setosa
## 11          4.4         3.2          1.3         0.2  setosa
## 12          5.1         3.5          1.4         0.2  setosa
## 13          4.9         3.0          1.4         0.2  setosa
## 14          5.0         3.6          1.4         0.2  setosa
## 15          4.6         3.4          1.4         0.3  setosa
  • If you specify a variable of class factor or character, you will rearrange the rows to alphabetical order

  • Use desc() if you want the opposite order

arrange(iris, desc(Species), # sort z to a since species is a factor
        Sepal.Width) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1           6.0         2.2          5.0         1.5 virginica
## 2           4.9         2.5          4.5         1.7 virginica
## 3           6.7         2.5          5.8         1.8 virginica
## 4           5.7         2.5          5.0         2.0 virginica
## 5           6.3         2.5          5.0         1.9 virginica
## 6           7.7         2.6          6.9         2.3 virginica
## 7           6.1         2.6          5.6         1.4 virginica
## 8           5.8         2.7          5.1         1.9 virginica
## 9           6.4         2.7          5.3         1.9 virginica
## 10          6.3         2.7          4.9         1.8 virginica
## 11          5.8         2.7          5.1         1.9 virginica
## 12          5.8         2.8          5.1         2.4 virginica
## 13          5.6         2.8          4.9         2.0 virginica
## 14          7.7         2.8          6.7         2.0 virginica
## 15          6.2         2.8          4.8         1.8 virginica

mutate()

  • Add a new variable (while preserving all existing variables)
mutate(iris, logSepLength = log(Sepal.Length)) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species logSepLength
## 1           5.1         3.5          1.4         0.2  setosa     1.629241
## 2           4.9         3.0          1.4         0.2  setosa     1.589235
## 3           4.7         3.2          1.3         0.2  setosa     1.547563
## 4           4.6         3.1          1.5         0.2  setosa     1.526056
## 5           5.0         3.6          1.4         0.2  setosa     1.609438
## 6           5.4         3.9          1.7         0.4  setosa     1.686399
## 7           4.6         3.4          1.4         0.3  setosa     1.526056
## 8           5.0         3.4          1.5         0.2  setosa     1.609438
## 9           4.4         2.9          1.4         0.2  setosa     1.481605
## 10          4.9         3.1          1.5         0.1  setosa     1.589235
## 11          5.4         3.7          1.5         0.2  setosa     1.686399
## 12          4.8         3.4          1.6         0.2  setosa     1.568616
## 13          4.8         3.0          1.4         0.1  setosa     1.568616
## 14          4.3         3.0          1.1         0.1  setosa     1.458615
## 15          5.8         4.0          1.2         0.2  setosa     1.757858
  • There’s also transmute() which deletes the old column(s) you use to make the new column
iris %>% transmute(Length_diff = Sepal.Length - Petal.Length,
                   Width_diff = Sepal.Width - Petal.Width) %>% head(n=15)
##    Length_diff Width_diff
## 1          3.7        3.3
## 2          3.5        2.8
## 3          3.4        3.0
## 4          3.1        2.9
## 5          3.6        3.4
## 6          3.7        3.5
## 7          3.2        3.1
## 8          3.5        3.2
## 9          3.0        2.7
## 10         3.4        3.0
## 11         3.9        3.5
## 12         3.2        3.2
## 13         3.4        2.9
## 14         3.2        2.9
## 15         4.6        3.8

filter()

  • Return rows matching specified conditions

  • Use with >, >=, <, <=, |, !, %in%, ==, and !=. Separating conditions by , represents the & argument.

iris %>% filter(Sepal.Length >= 2) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
iris %>% filter(Petal.Length >= mean(Petal.Length)) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1           7.0         3.2          4.7         1.4 versicolor
## 2           6.4         3.2          4.5         1.5 versicolor
## 3           6.9         3.1          4.9         1.5 versicolor
## 4           5.5         2.3          4.0         1.3 versicolor
## 5           6.5         2.8          4.6         1.5 versicolor
## 6           5.7         2.8          4.5         1.3 versicolor
## 7           6.3         3.3          4.7         1.6 versicolor
## 8           6.6         2.9          4.6         1.3 versicolor
## 9           5.2         2.7          3.9         1.4 versicolor
## 10          5.9         3.0          4.2         1.5 versicolor
## 11          6.0         2.2          4.0         1.0 versicolor
## 12          6.1         2.9          4.7         1.4 versicolor
## 13          6.7         3.1          4.4         1.4 versicolor
## 14          5.6         3.0          4.5         1.5 versicolor
## 15          5.8         2.7          4.1         1.0 versicolor
iris %>% filter(Species != "versicolor") %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
iris %>% filter(Species %in% c("versicolor", "setosa")) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa

select()

  • Keep only specified variables
select(iris, Sepal.Length, Sepal.Width) %>% head(n=15)
##    Sepal.Length Sepal.Width
## 1           5.1         3.5
## 2           4.9         3.0
## 3           4.7         3.2
## 4           4.6         3.1
## 5           5.0         3.6
## 6           5.4         3.9
## 7           4.6         3.4
## 8           5.0         3.4
## 9           4.4         2.9
## 10          4.9         3.1
## 11          5.4         3.7
## 12          4.8         3.4
## 13          4.8         3.0
## 14          4.3         3.0
## 15          5.8         4.0
  • Specify variables to exclude with -
select(iris, -Species) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           5.1         3.5          1.4         0.2
## 2           4.9         3.0          1.4         0.2
## 3           4.7         3.2          1.3         0.2
## 4           4.6         3.1          1.5         0.2
## 5           5.0         3.6          1.4         0.2
## 6           5.4         3.9          1.7         0.4
## 7           4.6         3.4          1.4         0.3
## 8           5.0         3.4          1.5         0.2
## 9           4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## 11          5.4         3.7          1.5         0.2
## 12          4.8         3.4          1.6         0.2
## 13          4.8         3.0          1.4         0.1
## 14          4.3         3.0          1.1         0.1
## 15          5.8         4.0          1.2         0.2
  • Select a range of variables with :
select(iris, Sepal.Length:Petal.Length) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length
## 1           5.1         3.5          1.4
## 2           4.9         3.0          1.4
## 3           4.7         3.2          1.3
## 4           4.6         3.1          1.5
## 5           5.0         3.6          1.4
## 6           5.4         3.9          1.7
## 7           4.6         3.4          1.4
## 8           5.0         3.4          1.5
## 9           4.4         2.9          1.4
## 10          4.9         3.1          1.5
## 11          5.4         3.7          1.5
## 12          4.8         3.4          1.6
## 13          4.8         3.0          1.4
## 14          4.3         3.0          1.1
## 15          5.8         4.0          1.2
  • If you select just one column, you will still get a dataframe. If you need a vector, use pull()
pull(iris, Sepal.Length) %>% head(n=15)
##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8

summarise()

  • Condenses data down to one value per group
summarise(iris, mean(Petal.Length)) %>% head(n=15)
##   mean(Petal.Length)
## 1              3.758
summarise(iris, sd_pl = sd(Petal.Length), var_pl = sd(Petal.Length)^2) %>% head(n=15)
##      sd_pl   var_pl
## 1 1.765298 3.116278

group_by()

  • Invisibly groups data by specified column(s)

  • Use with other verbs to get grouped information

iris %>% group_by(Species) %>% head(n=15)
## # A tibble: 15 x 5
## # Groups:   Species [1]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## 11          5.4         3.7          1.5         0.2 setosa 
## 12          4.8         3.4          1.6         0.2 setosa 
## 13          4.8         3            1.4         0.1 setosa 
## 14          4.3         3            1.1         0.1 setosa 
## 15          5.8         4            1.2         0.2 setosa
iris %>%
  group_by(Species) %>%
  summarise(mean(Petal.Length), sd(Petal.Length)) %>% head(n=15)
## # A tibble: 3 x 3
##   Species    `mean(Petal.Length)` `sd(Petal.Length)`
##   <fct>                     <dbl>              <dbl>
## 1 setosa                     1.46              0.174
## 2 versicolor                 4.26              0.470
## 3 virginica                  5.55              0.552
  • Data will remain grouped until you use ungroup()

rename()

  • Give your columns new names. Syntax is newColumn = oldColumn.
iris %>% rename(sep_len = Sepal.Length) %>% head(n=15)
##    sep_len Sepal.Width Petal.Length Petal.Width Species
## 1      5.1         3.5          1.4         0.2  setosa
## 2      4.9         3.0          1.4         0.2  setosa
## 3      4.7         3.2          1.3         0.2  setosa
## 4      4.6         3.1          1.5         0.2  setosa
## 5      5.0         3.6          1.4         0.2  setosa
## 6      5.4         3.9          1.7         0.4  setosa
## 7      4.6         3.4          1.4         0.3  setosa
## 8      5.0         3.4          1.5         0.2  setosa
## 9      4.4         2.9          1.4         0.2  setosa
## 10     4.9         3.1          1.5         0.1  setosa
## 11     5.4         3.7          1.5         0.2  setosa
## 12     4.8         3.4          1.6         0.2  setosa
## 13     4.8         3.0          1.4         0.1  setosa
## 14     4.3         3.0          1.1         0.1  setosa
## 15     5.8         4.0          1.2         0.2  setosa

Helper functions

everything()

  • Move columns to the front of your data
select(iris, Species, everything()) %>% head(n=15)
##    Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   setosa          5.1         3.5          1.4         0.2
## 2   setosa          4.9         3.0          1.4         0.2
## 3   setosa          4.7         3.2          1.3         0.2
## 4   setosa          4.6         3.1          1.5         0.2
## 5   setosa          5.0         3.6          1.4         0.2
## 6   setosa          5.4         3.9          1.7         0.4
## 7   setosa          4.6         3.4          1.4         0.3
## 8   setosa          5.0         3.4          1.5         0.2
## 9   setosa          4.4         2.9          1.4         0.2
## 10  setosa          4.9         3.1          1.5         0.1
## 11  setosa          5.4         3.7          1.5         0.2
## 12  setosa          4.8         3.4          1.6         0.2
## 13  setosa          4.8         3.0          1.4         0.1
## 14  setosa          4.3         3.0          1.1         0.1
## 15  setosa          5.8         4.0          1.2         0.2

starts_with()

select(iris, starts_with("Petal")) %>% head(n=15)
##    Petal.Length Petal.Width
## 1           1.4         0.2
## 2           1.4         0.2
## 3           1.3         0.2
## 4           1.5         0.2
## 5           1.4         0.2
## 6           1.7         0.4
## 7           1.4         0.3
## 8           1.5         0.2
## 9           1.4         0.2
## 10          1.5         0.1
## 11          1.5         0.2
## 12          1.6         0.2
## 13          1.4         0.1
## 14          1.1         0.1
## 15          1.2         0.2

ends_with()

select(iris, ends_with("Length")) %>% head(n=15)
##    Sepal.Length Petal.Length
## 1           5.1          1.4
## 2           4.9          1.4
## 3           4.7          1.3
## 4           4.6          1.5
## 5           5.0          1.4
## 6           5.4          1.7
## 7           4.6          1.4
## 8           5.0          1.5
## 9           4.4          1.4
## 10          4.9          1.5
## 11          5.4          1.5
## 12          4.8          1.6
## 13          4.8          1.4
## 14          4.3          1.1
## 15          5.8          1.2

contains()

  • Searches column names for a specified string
select(iris, contains("Wid")) %>% head(n=15)
##    Sepal.Width Petal.Width
## 1          3.5         0.2
## 2          3.0         0.2
## 3          3.2         0.2
## 4          3.1         0.2
## 5          3.6         0.2
## 6          3.9         0.4
## 7          3.4         0.3
## 8          3.4         0.2
## 9          2.9         0.2
## 10         3.1         0.1
## 11         3.7         0.2
## 12         3.4         0.2
## 13         3.0         0.1
## 14         3.0         0.1
## 15         4.0         0.2

matches()

  • Searches column names for a specified regular expression
select(iris, matches("wid|spec")) %>% head(n=15)
##    Sepal.Width Petal.Width Species
## 1          3.5         0.2  setosa
## 2          3.0         0.2  setosa
## 3          3.2         0.2  setosa
## 4          3.1         0.2  setosa
## 5          3.6         0.2  setosa
## 6          3.9         0.4  setosa
## 7          3.4         0.3  setosa
## 8          3.4         0.2  setosa
## 9          2.9         0.2  setosa
## 10         3.1         0.1  setosa
## 11         3.7         0.2  setosa
## 12         3.4         0.2  setosa
## 13         3.0         0.1  setosa
## 14         3.0         0.1  setosa
## 15         4.0         0.2  setosa

For more info on regex see here

row_number()

  • Specify which row number you want for your verb
iris %>%
  group_by(Species) %>%
  filter(row_number() == 1) %>% head(n=15)
## # A tibble: 3 x 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5.1         3.5          1.4         0.2 setosa    
## 2          7           3.2          4.7         1.4 versicolor
## 3          6.3         3.3          6           2.5 virginica

n()

  • References the number of rows in your data frame (or for each group in a 'grouped_df')
iris %>%
  group_by(Species) %>%
  filter(row_number() == n()) %>% head(n=15)
## # A tibble: 3 x 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
## 1          5           3.3          1.4         0.2 setosa    
## 2          5.7         2.8          4.1         1.3 versicolor
## 3          5.9         3            5.1         1.8 virginica

Fancy verbs

  • Scoped verbs take the additional arguments .predicate, .funs, and .vars

  • They end in _at(), _if() and _all()

  • Signify what function (.funs) should be applied too all variables, only at certain variables (.vars), or only if variables meet a certain condition (.predicate)

  • ie. mutate_at(), mutate_if(), mutate_all(), summarise_if(), summarise_at(), summarise_all(), select_if(), select_at(), rename_if(), filter_all(), arrange_all(), group_by_at() …the list goes on!

  • A few examples (we will see more later)

iris %>% select_if(.predicate = is.numeric) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1           5.1         3.5          1.4         0.2
## 2           4.9         3.0          1.4         0.2
## 3           4.7         3.2          1.3         0.2
## 4           4.6         3.1          1.5         0.2
## 5           5.0         3.6          1.4         0.2
## 6           5.4         3.9          1.7         0.4
## 7           4.6         3.4          1.4         0.3
## 8           5.0         3.4          1.5         0.2
## 9           4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## 11          5.4         3.7          1.5         0.2
## 12          4.8         3.4          1.6         0.2
## 13          4.8         3.0          1.4         0.1
## 14          4.3         3.0          1.1         0.1
## 15          5.8         4.0          1.2         0.2
iris %>% select_if(.predicate = is.numeric, .funs=funs(paste0("num_",.))) %>% head(n=15) # can also be used to rename
##    num_Sepal.Length num_Sepal.Width num_Petal.Length num_Petal.Width
## 1               5.1             3.5              1.4             0.2
## 2               4.9             3.0              1.4             0.2
## 3               4.7             3.2              1.3             0.2
## 4               4.6             3.1              1.5             0.2
## 5               5.0             3.6              1.4             0.2
## 6               5.4             3.9              1.7             0.4
## 7               4.6             3.4              1.4             0.3
## 8               5.0             3.4              1.5             0.2
## 9               4.4             2.9              1.4             0.2
## 10              4.9             3.1              1.5             0.1
## 11              5.4             3.7              1.5             0.2
## 12              4.8             3.4              1.6             0.2
## 13              4.8             3.0              1.4             0.1
## 14              4.3             3.0              1.1             0.1
## 15              5.8             4.0              1.2             0.2
iris %>% summarise_if(.predicate = is.numeric, .funs = funs(mean(., na.rm=T))) %>% head(n=15)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.843333    3.057333        3.758    1.199333
iris %>% mutate_at(.vars = vars(c("Sepal.Length","Petal.Length")),
                                         .funs = funs(scale)) %>% head(n=15)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1   -0.89767388         3.5    -1.335752         0.2  setosa
## 2   -1.13920048         3.0    -1.335752         0.2  setosa
## 3   -1.38072709         3.2    -1.392399         0.2  setosa
## 4   -1.50149039         3.1    -1.279104         0.2  setosa
## 5   -1.01843718         3.6    -1.335752         0.2  setosa
## 6   -0.53538397         3.9    -1.165809         0.4  setosa
## 7   -1.50149039         3.4    -1.335752         0.3  setosa
## 8   -1.01843718         3.4    -1.279104         0.2  setosa
## 9   -1.74301699         2.9    -1.335752         0.2  setosa
## 10  -1.13920048         3.1    -1.279104         0.1  setosa
## 11  -0.53538397         3.7    -1.279104         0.2  setosa
## 12  -1.25996379         3.4    -1.222456         0.2  setosa
## 13  -1.25996379         3.0    -1.335752         0.1  setosa
## 14  -1.86378030         3.0    -1.505695         0.1  setosa
## 15  -0.05233076         4.0    -1.449047         0.2  setosa
# z scores by Species for all numeric variables
iris %>% group_by(Species) %>%
  mutate_if(.predicate = is.numeric, .funs = funs(scale)) %>% head(n=15)
## # A tibble: 15 x 5
## # Groups:   Species [1]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1       0.267       0.190        -0.357      -0.436 setosa 
##  2      -0.301      -1.13         -0.357      -0.436 setosa 
##  3      -0.868      -0.601        -0.933      -0.436 setosa 
##  4      -1.15       -0.865         0.219      -0.436 setosa 
##  5      -0.0170      0.454        -0.357      -0.436 setosa 
##  6       1.12        1.25          1.37        1.46  setosa 
##  7      -1.15       -0.0739       -0.357       0.512 setosa 
##  8      -0.0170     -0.0739        0.219      -0.436 setosa 
##  9      -1.72       -1.39         -0.357      -0.436 setosa 
## 10      -0.301      -0.865         0.219      -1.39  setosa 
## 11       1.12        0.718         0.219      -0.436 setosa 
## 12      -0.584      -0.0739        0.795      -0.436 setosa 
## 13      -0.584      -1.13         -0.357      -1.39  setosa 
## 14      -2.00       -1.13         -2.08       -1.39  setosa 
## 15       2.25        1.51         -1.51       -0.436 setosa
  • I have never used any of the group_by_*() but I imagine they’re useful when you have a large selection of identifiers/grouping variables that you can call easily with a predicate (for example, is.factor). They can also be used with a .funs argument as a shortcut to group_by() %>% mutate().

Miscellaneous verbs

lag()

  • Makes a new column with the value of one row previously

lead()

  • Makes a new column with the value of one row ahead
x <- runif(5)
cbind(ahead = lead(x), x, behind = lag(x))
##           ahead          x    behind
## [1,] 0.47072787 0.78080563        NA
## [2,] 0.64402818 0.47072787 0.7808056
## [3,] 0.32345814 0.64402818 0.4707279
## [4,] 0.08515902 0.32345814 0.6440282
## [5,]         NA 0.08515902 0.3234581
  • Can choose a time column to order your new values by with the argument 'order_by'
  • Lag and lead are very useful for longitudinal models

complete()

incomplete_df <- data.frame(day = c(1,3,7,9), dose = c(0, 25, 40, 30)) %>% head(n=15)
incomplete_df
##   day dose
## 1   1    0
## 2   3   25
## 3   7   40
## 4   9   30
complete(incomplete_df, day = full_seq(1:max(day), 1)) %>% head(n=15)
## # A tibble: 9 x 2
##     day  dose
##   <dbl> <dbl>
## 1     1     0
## 2     2    NA
## 3     3    25
## 4     4    NA
## 5     5    NA
## 6     6    NA
## 7     7    40
## 8     8    NA
## 9     9    30

fill()

  • Fill in missing values with values before or after
incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  fill(dose, .direction = "down") %>% head(n=15)
## # A tibble: 9 x 2
##     day  dose
##   <dbl> <dbl>
## 1     1     0
## 2     2     0
## 3     3    25
## 4     4    25
## 5     5    25
## 6     6    25
## 7     7    40
## 8     8    40
## 9     9    30

drop_na()

incomplete_df %>%
  complete(day = full_seq(1:max(day), 1)) %>%
  drop_na() %>% head(n=15) # when you have more variables, specify which columns you care about dropping NAs from
## # A tibble: 4 x 2
##     day  dose
##   <dbl> <dbl>
## 1     1     0
## 2     3    25
## 3     7    40
## 4     9    30

sample_frac()

  • Randomly sample a specified fraction of rows of a data frame
mydat <- data.frame(id = sample(1:100, 20), meas = rnorm(20))
mydat
##     id         meas
## 1   57  2.288186793
## 2  100 -1.602300385
## 3   46  0.609128605
## 4   85 -0.481366290
## 5   71 -0.333337018
## 6   99  0.551787362
## 7    5 -0.042839936
## 8   75 -0.803681418
## 9   67  0.904749782
## 10  23  0.333527940
## 11  77  0.040703715
## 12  33  0.002205167
## 13  37  0.943927491
## 14  41  1.027059802
## 15  64  1.005199118
## 16  62  0.591302764
## 17   3  1.262050643
## 18   7 -0.644000308
## 19  97 -0.768234573
## 20  45  0.698145334
mydat %>% sample_frac(size = .5)
##     id       meas
## 1   71 -0.3333370
## 2    3  1.2620506
## 3   75 -0.8036814
## 4   85 -0.4813663
## 5   45  0.6981453
## 6   64  1.0051991
## 7  100 -1.6023004
## 8   62  0.5913028
## 9   37  0.9439275
## 10  41  1.0270598
mydat %>% sample_frac(size= .5, replace = T) # you can also add sampling weights
##    id         meas
## 1  62  0.591302764
## 2  75 -0.803681418
## 3  33  0.002205167
## 4  97 -0.768234573
## 5  23  0.333527940
## 6  23  0.333527940
## 7  67  0.904749782
## 8  62  0.591302764
## 9  85 -0.481366290
## 10 75 -0.803681418

sample_n()

  • Randomly sample a specified number of rows of a data frame
mydat %>% sample_n(3)
##   id      meas
## 1 62 0.5913028
## 2 37 0.9439275
## 3 41 1.0270598

Joining functions

x <- data.frame(id = c("A","B","C"), val = 1:3)
y <- data.frame(id = c("A","B","D"), val = c("T","F","T"))
x
##   id val
## 1  A   1
## 2  B   2
## 3  C   3
y
##   id val
## 1  A   T
## 2  B   F
## 3  D   T
  • “Mutating” joins combine variables from the left and right hand sides ie. full_join(), inner_join(), right_join(), and left_join()

full_join()

  • Return all rows and columns
full_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>
## 4  D    NA     T

inner_join()

  • Return all rows from x that have a match in y, and all columns from x and y
inner_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F

left_join()

  • Return all rows from x and all columns from x and y
left_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  C     3  <NA>

right_join()

  • Return all rows from x and all columns from x and y
right_join(x, y, by = "id")
##   id val.x val.y
## 1  A     1     T
## 2  B     2     F
## 3  D    NA     T
  • “Filtering” joins keep cases from the left hand side, ie. semi_join() and anti_join()

semi_join()

  • Return all rows in x that have a match in y, keeping only columns from y
semi_join(x, y, by = "id")
##   id val
## 1  A   1
## 2  B   2

anti_join()

  • Return all rows from x where there are not matching values in y, keeping just the columns from x
anti_join(x, y, by = "id")
##   id val
## 1  C   3

Incorporating dplyr into your workflow

Frequency Tables

library(kableExtra)
mtcars %>%
  group_by(vs, am) %>%
  summarise(n=n(), freq=n()/nrow(.)) %>%
  kable(caption="Frequency Table of vs and am", format="html") %>%
  kable_styling(c("condensed","responsive","striped","hover"), full_width = F)
Table 1: Frequency Table of vs and am
vs am n freq
0 0 12 0.37500
0 1 6 0.18750
1 0 7 0.21875
1 1 7 0.21875

compareGroups

library(compareGroups)
iris %>%
  mutate(Sepal.Length.Sq = Sepal.Length^2) %>%
  compareGroups(Species ~ ., data = .) %>%
  createTable() %>%
  export2md() %>%
  kable_styling(c("condensed","responsive","striped","hover"))
Table 2: Summary descriptives table by groups of `Species’
setosa versicolor virginica p.overall
N=50 N=50 N=50
Sepal.Length 5.01 (0.35) 5.94 (0.52) 6.59 (0.64) <0.001
Sepal.Width 3.43 (0.38) 2.77 (0.31) 2.97 (0.32) <0.001
Petal.Length 1.46 (0.17) 4.26 (0.47) 5.55 (0.55) <0.001
Petal.Width 0.25 (0.11) 1.33 (0.20) 2.03 (0.27) <0.001
Sepal.Length.Sq 25.2 (3.55) 35.5 (6.16) 43.8 (8.44) <0.001

ggplotting

iris %>%
  filter(Species != "setosa") %>%
  ggplot(aes(Sepal.Length, Petal.Width, col=Species)) +
  geom_point()

Finding columns of interest

  • When you have too many column names to look through manually, search for a string or pattern of strings
load("dplyr_dat.Rdata")
length(names(brain))
## [1] 1000
select(brain, matches("pib.*parietal.*")) %>% names() # regex is not case sensitive
## [1] "PIB.PET_AAL__Parietal_Inf_L"        
## [2] "PIB.PET_AAL__Parietal_Inf_R"        
## [3] "PIB.PET_AAL__Parietal_Sup_L"        
## [4] "PIB.PET_AAL__Parietal_Sup_R"        
## [5] "PIB.PET_FS__ctx.lh.inferiorparietal"
## [6] "PIB.PET_FS__ctx.lh.superiorparietal"
## [7] "PIB.PET_FS__ctx.rh.inferiorparietal"
## [8] "PIB.PET_FS__ctx.rh.superiorparietal"

Missing data wrangling

  • Add a column flagging a value as missing and then replace the missing values with the mean of the other values
misdat <- data.frame(x1 = sample(c(1:3, NA), 13, replace=T),
                     x2 = sample(c(-5:-2,NA), 13, replace=T))
misdat
##    x1 x2
## 1   3 -3
## 2   3 -5
## 3   3 -2
## 4  NA NA
## 5  NA -5
## 6  NA -2
## 7   1 -3
## 8   3 NA
## 9   1 -4
## 10 NA -3
## 11 NA -4
## 12  3 -5
## 13 NA -4
misdat %>%
  mutate_all(.funs = funs(miss = ifelse(is.na(.), 1, 0))) %>%
  mutate_all(.funs = funs(replace_na(., mean(., na.rm=T))))
##          x1        x2 x1_miss x2_miss
## 1  3.000000 -3.000000       0       0
## 2  3.000000 -5.000000       0       0
## 3  3.000000 -2.000000       0       0
## 4  2.428571 -3.636364       1       1
## 5  2.428571 -5.000000       1       0
## 6  2.428571 -2.000000       1       0
## 7  1.000000 -3.000000       0       0
## 8  3.000000 -3.636364       0       1
## 9  1.000000 -4.000000       0       0
## 10 2.428571 -3.000000       1       0
## 11 2.428571 -4.000000       1       0
## 12 3.000000 -5.000000       0       0
## 13 2.428571 -4.000000       1       0

Longitudinal data wrangling A

  • Calculate the time since a patient was first admitted to the hospital
id <- c(1,1,1,2,2,2,2,3)
admit <- as.Date(c("2017-06-22", "2017-07-13", "2017-08-29",
                   "2017-04-01", "2017-05-02", "2017-11-14", "2018-01-14",
                   "2019-01-01"))
discharge <- as.Date(c("2017-06-25", "2017-07-31", "2017-10-13",
                   "2017-04-02", "2017-05-10", "2017-11-18", "2018-02-12",
                   "2019-01-05"))
hosp_dat <- data.frame(id, admit, discharge)
hosp_dat
##   id      admit  discharge
## 1  1 2017-06-22 2017-06-25
## 2  1 2017-07-13 2017-07-31
## 3  1 2017-08-29 2017-10-13
## 4  2 2017-04-01 2017-04-02
## 5  2 2017-05-02 2017-05-10
## 6  2 2017-11-14 2017-11-18
## 7  2 2018-01-14 2018-02-12
## 8  3 2019-01-01 2019-01-05
hosp_dat %>%
  group_by(id) %>%
  mutate(days_since_init_admit = discharge - admit[1])
## # A tibble: 8 x 4
## # Groups:   id [3]
##      id admit      discharge  days_since_init_admit
##   <dbl> <date>     <date>     <drtn>               
## 1     1 2017-06-22 2017-06-25   3 days             
## 2     1 2017-07-13 2017-07-31  39 days             
## 3     1 2017-08-29 2017-10-13 113 days             
## 4     2 2017-04-01 2017-04-02   1 days             
## 5     2 2017-05-02 2017-05-10  39 days             
## 6     2 2017-11-14 2017-11-18 231 days             
## 7     2 2018-01-14 2018-02-12 317 days             
## 8     3 2019-01-01 2019-01-05   4 days

Longitudinal data wrangling B

  • Make a row for every patient for every month from the start of follow up to the end of follow up (get equally spaced time intervals)

  • Make another column containing the drugs the patient was on previously (for prediction or longitudinal models)

load("meds.Rdata")
meds
##    PatientID Months_ImplantToVisit BB_yn ACE_yn ARB_yn
## 1          1                     1     0      0      0
## 2          1                     2     0      0      0
## 3          2                     2     0      1      0
## 4          2                     3     0      1      0
## 5          2                     4     0      1      0
## 6          2                     5     0      1      0
## 7          2                     6     1      1      0
## 8          2                     8     1      1      0
## 9          2                    10     1      1      0
## 10         2                    12     1      1      0
## 11         2                    13     1      1      0
## 12         2                    14     1      1      0
## 13         2                    16     1      1      0
## 14         3                     2     0      0      0
## 15         3                     3     0      0      0
meds %>%
    group_by(PatientID) %>%
    complete(Months_ImplantToVisit = full_seq(1:max(Months_ImplantToVisit), 1)) %>%
    fill(ends_with("_yn")) %>%
    fill(ends_with("_yn"), .direction="up") -> meds1
meds1
## # A tibble: 21 x 5
## # Groups:   PatientID [3]
##    PatientID Months_ImplantToVisit BB_yn ACE_yn ARB_yn
##        <int>                 <dbl> <fct> <fct>  <fct> 
##  1         1                     1 0     0      0     
##  2         1                     2 0     0      0     
##  3         2                     1 0     1      0     
##  4         2                     2 0     1      0     
##  5         2                     3 0     1      0     
##  6         2                     4 0     1      0     
##  7         2                     5 0     1      0     
##  8         2                     6 1     1      0     
##  9         2                     7 1     1      0     
## 10         2                     8 1     1      0     
## # … with 11 more rows
meds1 %>%
  mutate_at(.vars = vars(ends_with("_yn")),
              .funs = funs(prev = lag(., order_by = Months_ImplantToVisit))) %>%
    fill(ends_with("_prev"), .direction="up") %>%
  head(n=15)
## # A tibble: 15 x 8
## # Groups:   PatientID [2]
##    PatientID Months_ImplantT… BB_yn ACE_yn ARB_yn BB_yn_prev ACE_yn_prev
##        <int>            <dbl> <fct> <fct>  <fct>  <fct>      <fct>      
##  1         1                1 0     0      0      0          0          
##  2         1                2 0     0      0      0          0          
##  3         2                1 0     1      0      0          1          
##  4         2                2 0     1      0      0          1          
##  5         2                3 0     1      0      0          1          
##  6         2                4 0     1      0      0          1          
##  7         2                5 0     1      0      0          1          
##  8         2                6 1     1      0      0          1          
##  9         2                7 1     1      0      1          1          
## 10         2                8 1     1      0      1          1          
## 11         2                9 1     1      0      1          1          
## 12         2               10 1     1      0      1          1          
## 13         2               11 1     1      0      1          1          
## 14         2               12 1     1      0      1          1          
## 15         2               13 1     1      0      1          1          
## # … with 1 more variable: ARB_yn_prev <fct>

Survival data wrangling

  • Map four columns “days to…outcome” to a composite endpoint for a survival model

  • Record when the event occurred as Days.to.first.event and which event it was in FE.status

dems %>%
  select(Days.to.lastFU, Days.to.death, Days.to.stroke, Days.to.GIB, Days.to.PT) %>%
  # which.min cannot handle NA, so we'll make NA's infinity for now
  replace(is.na(.), Inf) %>%
  # to allow which.min to search along rows (dplyr naturally looks down columns)
  rowwise() %>%
  mutate(
    FUorFEtime = pmin(Days.to.lastFU, Days.to.death, Days.to.stroke,
                      Days.to.GIB, Days.to.PT,
                      na.rm = T),
    # numbers correspond to order of the Days* columns
    FUorFEstatus = which.min(c(Days.to.lastFU, Days.to.death, Days.to.stroke,
                                    Days.to.GIB, Days.to.PT)),
    # condensed variable for survival model, 1 if any event
  Event_yn = ifelse(FUorFEstatus == 1, 0, 1)) -> dems_int

dems_int # look at the intermediate output
## Source: local data frame [265 x 8]
## Groups: <by row>
## 
## # A tibble: 265 x 8
##    Days.to.lastFU Days.to.death Days.to.stroke Days.to.GIB Days.to.PT
##             <dbl>         <dbl>          <dbl>       <dbl>      <dbl>
##  1             98            98             94         Inf        Inf
##  2            163           Inf            Inf         Inf        141
##  3            185           Inf            Inf         Inf        Inf
##  4            139           Inf            Inf         Inf        Inf
##  5            196           Inf            Inf          78        Inf
##  6            161           Inf            Inf         Inf        Inf
##  7            210           Inf            185         Inf        Inf
##  8            233           Inf            Inf         Inf        Inf
##  9            239           Inf            Inf         Inf        Inf
## 10            243           Inf            Inf         Inf        Inf
## # … with 255 more rows, and 3 more variables: FUorFEtime <dbl>,
## #   FUorFEstatus <int>, Event_yn <dbl>
dems_int %>%
  # column names correspond to the order of the columns, rename
  mutate(FUorFEstatus = case_when(FUorFEstatus == 1 ~ "censored",
                                  FUorFEstatus == 2 ~ "death",
                                  FUorFEstatus == 3 ~ "stroke",
                                  FUorFEstatus == 4 ~ "gib",
                                  FUorFEstatus == 5 ~ "pt",
                                  TRUE ~ "error")) %>%
  # not case sensitive
  select(contains("fe"), Event_yn) %>%
  # allow for joining with no duplicate cols
  rownames_to_column() %>%
  full_join(dems %>% rownames_to_column()) %>%
  select(-rowname) -> dems_clean

# check to make sure everything worked
dems_clean %>% filter(FUorFEstatus == "error")
## Source: local data frame [0 x 55]
## Groups: <by row>
## 
## # A tibble: 0 x 55
## # … with 55 variables: FUorFEtime <dbl>, FUorFEstatus <chr>,
## #   Event_yn <dbl>, PatientID <int>, Year.of.Implant <int>, Outcome <int>,
## #   Days.to.lastFU <int>, Stroke <int>, Days.to.stroke <int>,
## #   TypeStroke <int>, Stroke_IA <int>, Days.to.IA.stroke <int>,
## #   TypeStroke_IA <int>, PT <int>, Days.to.PT <int>, GIB <int>,
## #   Days.to.GIB <int>, GIB_IA <int>, Days.to.GIB_IA <int>,
## #   Age.at.implant <int>, Sex <fct>, Race <int>, Caucasian <int>,
## #   AfibFlut <fct>, Smoking_Hx <fct>, DM <fct>, Ischemic <int>,
## #   HTN_Hx <fct>, Stroke_Hx <fct>, Pulmonary_Hx <fct>, ICD <fct>,
## #   IMCS <int>, DT <int>, Hb_pre <dbl>, PLT_pre <int>, INR_dx <dbl>,
## #   Creat_dx <dbl>, Height <dbl>, Weight <dbl>, BMI <dbl>, Speed <int>,
## #   PI <dbl>, Flow <dbl>, RVAD <fct>, IABP <fct>,
## #   Days.to.first.outpt.visit <int>, Days.to.IA.discharge <int>, EF <fct>,
## #   LVEDD <fct>, RV_Dysf <fct>, LDH_Dx <int>, eGFR <dbl>, AvgMAP <lgl>,
## #   Death <int>, Days.to.death <int>
dems_clean %>%
  head(n=15)
## Source: local data frame [15 x 55]
## Groups: <by row>
## 
## # A tibble: 15 x 55
##    FUorFEtime FUorFEstatus Event_yn PatientID Year.of.Implant Outcome
##         <dbl> <chr>           <dbl>     <int>           <int>   <int>
##  1         94 stroke              1         1            2016       2
##  2        141 pt                  1         2            2016       7
##  3        185 censored            0         3            2016       7
##  4        139 censored            0         4            2016       6
##  5         78 gib                 1         5            2016       7
##  6        161 censored            0         6            2016       1
##  7        185 stroke              1         7            2016       7
##  8        233 censored            0         8            2016       7
##  9        239 censored            0         9            2016       7
## 10        243 censored            0        10            2016       7
## 11        248 censored            0        11            2016       7
## 12         51 pt                  1        12            2016       1
## 13         37 gib                 1        13            2016       7
## 14        261 censored            0        14            2016       7
## 15        265 censored            0        15            2016       7
## # … with 49 more variables: Days.to.lastFU <int>, Stroke <int>,
## #   Days.to.stroke <int>, TypeStroke <int>, Stroke_IA <int>,
## #   Days.to.IA.stroke <int>, TypeStroke_IA <int>, PT <int>,
## #   Days.to.PT <int>, GIB <int>, Days.to.GIB <int>, GIB_IA <int>,
## #   Days.to.GIB_IA <int>, Age.at.implant <int>, Sex <fct>, Race <int>,
## #   Caucasian <int>, AfibFlut <fct>, Smoking_Hx <fct>, DM <fct>,
## #   Ischemic <int>, HTN_Hx <fct>, Stroke_Hx <fct>, Pulmonary_Hx <fct>,
## #   ICD <fct>, IMCS <int>, DT <int>, Hb_pre <dbl>, PLT_pre <int>,
## #   INR_dx <dbl>, Creat_dx <dbl>, Height <dbl>, Weight <dbl>, BMI <dbl>,
## #   Speed <int>, PI <dbl>, Flow <dbl>, RVAD <fct>, IABP <fct>,
## #   Days.to.first.outpt.visit <int>, Days.to.IA.discharge <int>, EF <fct>,
## #   LVEDD <fct>, RV_Dysf <fct>, LDH_Dx <int>, eGFR <dbl>, AvgMAP <lgl>,
## #   Death <int>, Days.to.death <int>

Making functions

  • Problem: dplyr doesn’t know what to do with quotes around the variable name, but you can’t put the column name into a function without quotes because R will try to find it as an object in your environment…

  • Solution: relying on the rlang package (sym, !!, !!!, etc)
irisSummary <- function(group){
    iris %>%
      group_by(!!sym(group)) %>%
      summarise(mean(Sepal.Length), sd(Sepal.Length))
}

irisSummary(group = "Species")
## # A tibble: 3 x 3
##   Species    `mean(Sepal.Length)` `sd(Sepal.Length)`
##   <fct>                     <dbl>              <dbl>
## 1 setosa                     5.01              0.352
## 2 versicolor                 5.94              0.516
## 3 virginica                  6.59              0.636
irisScatPlot <- function(x, y){
    iris %>%
    ggplot(aes_string(x, y, col="Species")) +
    geom_point() -> p
  return(p)
}

irisScatPlot(x="Sepal.Length",y="Petal.Length")

Avatar
Katherine Hoffman
Research Biostatistician I

I am passionate about meaningful, reproducible medical research.

Related