This post walks through code to create a timeline in
R
usingggplot2
. These types of plots can help visualize treatment or measurement patterns, time-varying covariates, outcomes, and loss to follow-up in longitudinal data settings.
Skip the step-by-step, just the code, please!
Background
Treatment timelines, or “swimmer plots”, are a visualization technique I find useful in exploring longitudinal data structures. A few years ago I shared how I make treatment timelines for a single treatment (categorical or continuous) in the post Patient Treatment Timelines for Longitudinal Survival Data.
Sometimes when I share these plots with collaborators, they ask me to add additional variables to the timelines. This post shows how to do that.
I’ll use a toy dataset on hospitalized COVID-19 patients, available to download on this Github repository. It is derived from a dataset from Electronic Health Record data during Spring 2020. This is a time period when there was large variation in provider practice in administering steroids, a type of drug that combats hyper-inflammation. Steroids are usually given to patients which exhibit an inflammatory profile; we will identify this using a threshold for low oxygen levels (severe hypoxia).
We will look at the treatment patterns of steroids as it relates to the timing of patients (1) reaching severe hypoxia and (2) being put on a ventilator (intubation). We will also include whether patients died. I used a similar figure in a recent manuscript on this topic, if you’re interested in learning more!
Exploring the data
The data set is in long format with one row per patient. Let’s first load the data set and libraries we’ll need, then look at the first 20 rows:
# install.packages(c("tidyverse","gt","RCurl"))
library(tidyverse)
library(gt)
dat_long <- read_csv("https://raw.githubusercontent.com/kathoffman/steroids-trial-emulation/main/data/dat_trt_timeline.csv", col_types = list(id = "c", steroids = "c", death = "c", severe = "c"))
If we look at the first patient (id = 797), we can see they were in the hospital for 17 days, never intubated, never receive steroids, and ultimately die (death
is 1
on the last day).
head(dat_long, n=20) |>
gt()
id | day | intubation_status | steroids | death | severe |
---|---|---|---|---|---|
797 | 0 | Not intubated | 0 | 0 | 0 |
797 | 1 | Not intubated | 0 | 0 | 0 |
797 | 2 | Not intubated | 0 | 0 | 1 |
797 | 3 | Not intubated | 0 | 0 | 0 |
797 | 4 | Not intubated | 0 | 0 | 0 |
797 | 5 | Not intubated | 0 | 0 | 0 |
797 | 6 | Not intubated | 0 | 0 | 0 |
797 | 7 | Not intubated | 0 | 0 | 0 |
797 | 8 | Not intubated | 0 | 0 | 0 |
797 | 9 | Not intubated | 0 | 0 | 0 |
797 | 10 | Not intubated | 0 | 0 | 0 |
797 | 11 | Not intubated | 0 | 0 | 0 |
797 | 12 | Not intubated | 0 | 0 | 0 |
797 | 13 | Not intubated | 0 | 0 | 0 |
797 | 14 | Not intubated | 0 | 0 | 0 |
797 | 15 | Not intubated | 0 | 0 | 0 |
797 | 16 | Not intubated | 0 | 1 | 0 |
285 | 0 | Not intubated | 0 | 0 | 0 |
285 | 1 | Not intubated | 0 | 0 | 0 |
285 | 2 | Not intubated | 0 | 0 | 0 |
We can plot all patients’ hospital length of stay, colored by intubation status using ggplot2
’s geom_line()
:
dat_long |>
ggplot(aes(x=day, y=id, col = intubation_status, group=id)) +
geom_line() +
theme_bw()
We can add our steroids
column to the plot by adding a point designating whether steroids
exposure was 1
(yes) or 0
(no) that day. We can see this results in points of two different colors on the lines of our plot. This can work just fine! …unless you want to add another variable to the timeline.
dat_long |>
ggplot(aes(x=day, y=id, col = intubation_status, group=id)) +
geom_point(aes(day, id, col = steroids)) +
geom_line() +
theme_bw()
Modify the data
We could edit the colors of the dots we don’t want so that they’re transparent (using NA
), but when you have other non-mutually exclusive dots you want to show, it’s simpler to just edit the data instead. So, we will now edit our data so that our three binary columns are turned into three *_this_day
column, where:
The value is
NA
if the observation did not experience that exposure/outcome that day (remember each day is a new row)The value is the
day
if the observation did experience the exposure/outcome. This is to make ourx
axis easy to specify inggplot2
.
dat_swim <-
dat_long |>
mutate(severe_this_day = case_when(severe == 1 ~ day),
steroids_this_day = case_when(steroids == 1 ~ day),
death_this_day = case_when(death == 1 ~ day))
While we’re at it, let’s modify the patient’s IDs so that we can rearrange our plot by length of each individual’s timeline. To do this, we will refactor the id
variable by a new variable id_sorted
.
There are probably other ways to do this, but this is what I do. I first make a new variable called max_day
that is the maximum day that the patient is in the study. I then nest()
the data by id
(by calling all the other columns).
I then make my new column using row_number()
, and unnest()
the data, which gives me clean IDs in a sorted fashion:
dat_swim <-
dat_swim |>
group_by(id) |>
mutate(max_day = max(day)) |>
ungroup() |>
nest(cols = day:death_this_day) |>
arrange(max_day) |>
mutate(id_sorted = factor(row_number())) |>
unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(cols)`
head(dat_swim) |> gt()
id | max_day | day | intubation_status | steroids | death | severe | severe_this_day | steroids_this_day | death_this_day | id_sorted |
---|---|---|---|---|---|---|---|---|---|---|
767 | 1 | 0 | Not intubated | 0 | 0 | 0 | NA | NA | NA | 1 |
767 | 1 | 1 | Not intubated | 0 | 0 | 0 | NA | NA | NA | 1 |
156 | 1 | 0 | Not intubated | 0 | 0 | 0 | NA | NA | NA | 2 |
156 | 1 | 1 | Not intubated | 0 | 1 | 0 | NA | NA | 1 | 2 |
804 | 2 | 0 | Not intubated | 0 | 0 | 0 | NA | NA | NA | 3 |
804 | 2 | 1 | Not intubated | 0 | 0 | 0 | NA | NA | NA | 3 |
Back to plotting
Now, we can re-plot the steroids and intubation statuses using our new data. This time without all the 0
values for steroids showing.
dat_swim |>
ggplot() +
geom_line(aes(x=day, y=id_sorted, col = intubation_status, group=id)) +
geom_point(aes(x=steroids_this_day, y=id_sorted, col="Steroids", shape="Steroids")) +
theme_bw()
## Warning: Removed 387 rows containing missing values (geom_point).
From this point on I’ll save the plot as p
and just keep adding onto it so you can see the new step.
We’ll see why we’re doing this in a second, but in creating this first iteration of p
using geom_line()
and geom_point()
, we also want to set the col
and shape
to match how we want the marker for steroids to appear in the legend. I’m also going to make minor edits to the size
of the point geometry right now, as well as the width of each timeline itself (also using the size
argument). Lastly, we are now going to use id_sorted
.
p <-
dat_swim |>
ggplot() +
geom_line(aes(x=day, y=id_sorted, col = intubation_status, group=id_sorted),
size=1.8) +
geom_point(aes(x=steroids_this_day, y=id_sorted, col="Steroids", shape="Steroids"), stroke=2) +
theme_bw()
p
## Warning: Removed 387 rows containing missing values (geom_point).
Let’s add hypoxia and death to the figure. We’ll use geom_point()
again, and again specify legend names for the col
and shape
arguments, and modify the size
and stroke
of our point geometries.
p <- p +
geom_point(aes(x=severe_this_day, y=id_sorted, col="Severe hypoxia", shape="Severe hypoxia"), size=2, stroke=1.5) +
geom_point(aes(x=death_this_day, y=id_sorted, col="Death", shape="Death"), size=2, stroke=1.5)
p
## Warning: Removed 387 rows containing missing values (geom_point).
## Warning: Removed 403 rows containing missing values (geom_point).
## Warning: Removed 414 rows containing missing values (geom_point).
Note that we get warning messages that values with NA
are removed. This is fine since we just created all those NA
s! I’m going to set my options so that warnings are suppressed for future code outputs to keep this post tidy.
knitr::opts_chunk$set(message=F, warning=F)
Modify the colors and shapes
Next let’s start changing our color and shape scales. We can change colors using scale_color_manual()
and and filling in the values
argument with a vector where the names of the vector match the names in the col
in our geom_point()
aesthetics.
I define my cols
in a vector outside the plotting code to keep everything cleaner. Note that the order we’re specifying here will continue throughout the rest of the plotting code!
# define colors for all geometries with a color argument
cols <- c("Severe hypoxia" = "#b24745", # red
"Intubated" = "darkslateblue", # navy
"Not intubated" = "#74aaff", # lighter blue
"Steroids"="#ffd966", # gold
"Death" = "black")
After we set values = cols
, the name
argument is simply the title we want for our legend (I chose “Patient Status”).
p <- p +
scale_color_manual(values = cols, name="Patient Status")
p
We can do the same to change shapes with scale_shape_manual()
. If you don’t want a shape for a certain status (e.g. intubation status), just leave it out of your vector.
# define shapes for all geometries with a shape argument (geom_point)
shapes <- c("Severe hypoxia" = 21,
"Steroids" = 15, # square
"Death" = 4) # cross # empty circle (control inside with fill argument if desired)
p <- p + scale_shape_manual(values = shapes, name = "Patient Status")
p
Fix the Legend
You’ll notice that our legend does not match the changes we made to the colors and shapes right now. This is because we’re stretching (possibly breaking) the grammar of graphics (the “gg” of ggplot2
) rules by assigning different colors and shapes across variables. Although ggplot2
wasn’t designed to do what we’re doing, we can override the legend aesthetics and still create a plot that shows correct and useful information.
We will do this by using the guides()
function. We can control each aesthetic here. We will first override the colors
legend with the code guide_legend(override.aes = list(...))
.
This allows us to change the shapes of the color legend by specifying a vector with the shapes we want in the order the labels appear in the legend. If we don’t want a shape to appear on the legend, we will use NA
.
I only want to show shapes for certain statuses (severe hypoxia, steroid administration, death), and not the intubation status of a patient, so I’ll set up my shape override vector accordingly. Note that the order in the legend follows the order of my color specification vector (cols
).
shape_override <- c(21, NA, NA, 15, 4) # order matches `cols`:severe, intubation (yes/no), steroids, death
p +
guides(color = guide_legend(
override.aes = list(
shape = shape_override) # modify the color legend to include shapes
)
)
To remove the line through Death, Severe hypoxia, and Steroids in our legend, we can override the aesthetics for linetype
with NA’s for those three labels. We will specify the default, linetype=1
, for our intubation status color labels.
We can additionally override the stroke and size arguments to correspond to our point geometries.
line_override <- c(NA,1,1,NA,NA) # order matches `cols`:severe, intubation (yes/no), steroids, death
stroke_override <- c(.8,1,1,1,1) # order matches `cols`:severe, intubation (yes/no), steroids, death
size_override <- c(2.5,2.5,2.6,2,2) # order matches `cols`:severe, intubation (yes/no), steroids, death
p + guides(color = guide_legend(
override.aes = list(
shape = shape_override,
linetype = line_override)
)
)
Now that our color legend contains all necessary information, we can remove the shape legend using the argument shape = "none"
in guides()
.
p <- p +
guides(color = guide_legend(
override.aes = list(
shape = shape_override,
linetype = line_override)
),
shape = "none"
)
p
Ok, the challenging parts are done! Now we can make some minor aesthetic edits using labs
, scale_x_continuous()
, and theme()
. I won’t go into detail on these edits because they’re fairly self-explanatory, but check out the help files if you’re unsure what these arguments in theme do!
p <- p +
labs(x="Days since hospitalization",y="Patient\nnumber",title="Treatment Timeline for N=30 Patients") +
scale_x_continuous(expand=c(0,0)) + # remove extra white space
theme(text=element_text(family="Poppins", size=11),
title = element_text(angle = 0, vjust=.5, size=12, face="bold"),
axis.title.y = element_text(angle = 0, vjust=.5, size=12, face="bold"),
axis.title.x = element_text(size=15, face="bold", vjust=-0.5, hjust=0),
axis.text.y = element_text(size=6, hjust=1.5),
axis.ticks.y = element_blank(),
legend.position = c(0.8, 0.3),
legend.title = element_text(colour="black", size=13, face=4),
legend.text = element_text(colour="black", size=10),
legend.background = element_rect(size=0.5, linetype="solid", colour ="gray30"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)
p
Hope this is helpful! As always let me know if you have any feedback or suggestions. If you’d like to copy-paste the code, here it is:
Just the Code
library(tidyverse)
library(gt)
dat_long <- read_csv("https://raw.githubusercontent.com/kathoffman/steroids-trial-emulation/main/data/dat_trt_timeline.csv", col_types = list(id = "c", steroids = "c", death = "c", severe = "c"))
# define colors for all geometries with a color argument
cols <- c("Severe hypoxia" = "#b24745", # red
"Intubated" = "darkslateblue", # navy
"Not intubated" = "#74aaff", # lighter blue
"Steroids"="#ffd966", # gold
"Death" = "black")
shapes <- c("Severe hypoxia" = 21,
"Steroids" = 15, # square
"Death" = 4) # cross # empty circle (control inside with fill argument if desired)
shape_override <- c(21, NA, NA, 15, 4) # order matches `cols`:severe, intubation (yes/no), steroids, death
line_override <- c(NA,1,1,NA,NA) # order matches `cols`:severe, intubation (yes/no), steroids, death
stroke_override <- c(.8,1,1,1,1) # order matches `cols`:severe, intubation (yes/no), steroids, death
size_override <- c(2.5,2.5,2.6,2,2) # order matches `cols`:severe, intubation (yes/no), steroids, death
# modify swimmer data to 1) only show events if yes 2) have an id ordered by max follow up
dat_swim <-
dat_long |>
mutate(severe_this_day = case_when(severe == 1 ~ day),
steroids_this_day = case_when(steroids == 1 ~ day),
death_this_day = case_when(death == 1 ~ day)) %>%
group_by(id) |>
mutate(max_day = max(day)) |>
ungroup() |>
nest(cols = day:death_this_day) |>
arrange(max_day) |>
mutate(id_sorted = factor(row_number())) |>
unnest()
dat_swim |>
ggplot() +
geom_line(aes(x=day, y=id_sorted, col = intubation_status, group=id_sorted),
size=1.8) +
geom_point(aes(x=steroids_this_day, y=id_sorted, col="Steroids", shape="Steroids"), stroke=2) +
geom_point(aes(x=severe_this_day, y=id_sorted, col="Severe hypoxia", shape="Severe hypoxia"), size=2, stroke=1.5) +
geom_point(aes(x=death_this_day, y=id_sorted, col="Death", shape="Death"), size=2, stroke=1.5) +
theme_bw() +
scale_color_manual(values = cols, name="Patient Status") +
scale_shape_manual(values = shapes, name = "Patient Status") +
guides(color = guide_legend(
override.aes = list(
shape = shape_override,
linetype = line_override)
),
shape = "none"
)+
labs(x="Days since hospitalization",y="Patient\nnumber",title="Treatment Timeline for N=30 Patients") +
scale_x_continuous(expand=c(0,0)) + # remove extra white space
theme(text=element_text(family="Poppins", size=11),
title = element_text(angle = 0, vjust=.5, size=12, face="bold"),
axis.title.y = element_text(angle = 0, vjust=.5, size=12, face="bold"),
axis.title.x = element_text(size=15, face="bold", vjust=-0.5, hjust=0),
axis.text.y = element_text(size=6, hjust=1.5),
axis.ticks.y = element_blank(),
legend.position = c(0.8, 0.3),
legend.title = element_text(colour="black", size=13, face=4),
legend.text = element_text(colour="black", size=10),
legend.background = element_rect(size=0.5, linetype="solid", colour ="gray30"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank()
)
Session Info
sessionInfo()
R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7
Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gt_0.6.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
[5] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
[9] ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] lubridate_1.8.0 here_1.0.1 assertthat_0.2.1 rprojroot_2.0.3
[5] digest_0.6.29 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0
[9] backports_1.4.1 reprex_2.0.1 evaluate_0.15 highr_0.9
[13] httr_1.4.2 blogdown_1.9 pillar_1.7.0 rlang_1.0.2
[17] curl_4.3.2 readxl_1.4.0 rstudioapi_0.13 jquerylib_0.1.4
[21] checkmate_2.0.0 rmarkdown_2.13 labeling_0.4.2 bit_4.0.4
[25] munsell_0.5.0 broom_0.8.0 compiler_4.1.3 modelr_0.1.8
[29] xfun_0.30 pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2
[33] bookdown_0.26 fansi_1.0.3 crayon_1.5.1 tzdb_0.3.0
[37] dbplyr_2.1.1 withr_2.5.0 cabinets_0.6.0 grid_4.1.3
[41] jsonlite_1.8.0 gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.2
[45] magrittr_2.0.3 scales_1.2.0 cli_3.3.0 stringi_1.7.6
[49] vroom_1.5.7 farver_2.1.0 fs_1.5.2 xml2_1.3.3
[53] bslib_0.3.1 ellipsis_0.3.2 generics_0.1.2 vctrs_0.4.1
[57] tools_4.1.3 bit64_4.0.5 glue_1.6.2 hms_1.1.1
[61] parallel_4.1.3 fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
[65] rvest_1.0.2 knitr_1.38 haven_2.5.0 sass_0.4.1