A ggplot code walkthrough for making treatment timelines or “swimmer plots” for longitudinal time-to-event data on 1) a categorical covariate and 2) a continuous covariate organized by missingness frequency.
November 3, 2019.
I am a biostatistician at a research university, and I often find myself working with longitudinal survival data. As with any data analysis, I need to examine the quality of my data before deciding which statistical methods to implement.
This post contains reproducible examples for how I prefer to visually explore survival data containing longitudinal exposures or covariates. I create a “treatment timeline” for each patient, and the end product looks something like this:
Each line represents one patient, and each square represents a measurement at a specified point. I find these graphs useful to look for patterns or discrepancies in follow up times, treatment, and missingness. They also allow me to verify that my coding is correct as I move through different data manipulations for my analyses.
For the following examples I generate longitudinal survival data sets. I skip over how I made this data because it’s not the focus of this post, but if you have questions about the data generation please let me know. In general, I am trying to work from the point in a survival data analysis where you’ve already calculated each individual’s time to event.
Example 1: Binary Treatment
To make these charts, my first step is to load the package tidyverse
, since I use functions from dplyr
, tidyr
, forcats
, and ggplot2
.
library(tidyverse)
Here is the data generation code. If you’re interested in how I made it, you can check the comments, but if not, I recommend just copying and pasting into your R
console to try the plotting code.
set.seed(7)
<- 50 # The data sets I make these visualizations for are typically 100-500 patients in size, but for space purposes I'll set my n to only 50.
n <-
dat tibble(.rows = n) %>% # empty data frame / tibble with n rows
mutate(pt_id = factor(row_number()), # patient ids are 1-n
pt_trt_prob = runif(n,0,1), # randomly generate a treatment probability for each patient so we can see patterns
months_followup = round(runif(n, 0, 20)), # randomly generate length of time in the study
death = rbinom(n, 1, .5)) %>% # death randomly occurs at any time point. Obviously an unrealistic assumption. :)
group_by(pt_id) %>% # group by patient so we can do more data manipulation
complete(months_followup = full_seq(0:max(months_followup), 1)) %>% # add i^n all the months patients are in the study
fill(pt_trt_prob, .direction = "up") %>% # fill in the treatment probability I made earlier so I can use this to add treatment for every time point
ungroup() %>% # no longer need patients grouped
mutate(trt = factor(rbinom(row_number(), 1, pt_trt_prob^2)), # fill in treatment for everyone based on their treatment probability
death = replace_na(death, 0)) %>% # also fill in death
select(pt_id, months_followup, trt, death) # remove leftover columns from data generation
Let’s look at the data. It’s in “long” format with patient ID’s repeating for each month_followup
they were in my (fake) study. At every month, we know whether or not they were on the treatment, and whether they died at that time point. This first example does not contain any missingness.
::kable(head(dat)) knitr
pt_id | months_followup | trt | death |
---|---|---|---|
1 | 0 | 1 | 0 |
1 | 1 | 1 | 0 |
1 | 2 | 1 | 0 |
1 | 3 | 1 | 0 |
1 | 4 | 1 | 0 |
1 | 5 | 1 | 0 |
We can now plot our data in a very basic way. We really only need to specify in the aes
thetics that the x
-axis is time, the y
-axis is subject IDs, the col
or should correspond to treatment, and our lines should be group
ed together by subject. Don’t forget that last one! Then we can say we want geom_line
to make a base timeline for each subject, and the points on the timeline should be squares (shape = 15
in the geom_point
mapping function).
Et voilà! A cute patient treatment timeline.
%>%
dat ggplot(aes(x = months_followup, y = pt_id, group = pt_id, col = trt)) +
geom_line() +
geom_point(shape = 15)
Okay, it’s not that cute. But that little bit of code is really the core of the plot! If you want to add markers of death (or any other event) and rearrange by length of follow up time it just takes a little bit of extra data manipulation and one additional line of ggplot2
code.
%>%
dat group_by(pt_id) %>%
# make a new column with all patients last follow up visit
mutate(last_month_followup = max(months_followup),
# new variable for month that patients died, if they died
month_death = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
# reorder pt id by last month of follow up (highest to lowest)
# without fct_rev, chart is arranged in opposite direction (lowest to highest)
ungroup() %>%
mutate(pt_id = fct_rev(fct_reorder(pt_id, last_month_followup))) %>%
ggplot(aes(x = months_followup, y = pt_id, group = pt_id, col = trt)) +
geom_line() +
geom_point(shape = 15) +
# add in a new mapping layer of points that correspond to death
geom_point(aes(x = month_death, y = pt_id), col = "black", shape = 4)
Warning: Removed 515 rows containing missing values (geom_point).
Don’t worry about the warning for removing missing values – it’s because we have NA
s at most months for our month_death
, so geom_point
doesn’t know where to put an ‘x’ marker.
Finally, let’s work on making the plot a bit nicer to look at:
%>%
dat group_by(pt_id) %>%
mutate(last_month_followup = max(months_followup),
month_death = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
ungroup() %>%
mutate(pt_id = fct_rev(fct_reorder(pt_id, last_month_followup)),
# make the treatment variable labels nicer
trt = factor(trt, levels=0:1, labels=c("No","Yes"))) %>%
ggplot(aes(x = months_followup, y = pt_id, group = pt_id, col = trt)) +
geom_line() +
geom_point(shape = 15) +
geom_point(aes(x = month_death, y = pt_id), col = "black", shape = 4) +
theme_bw() +
labs(x = "Months of Follow-Up", y = "Patient ID", col = "Treatment",
title = "Patient Treatment Timeline",
subtitle = "x indicates month of patient death") +
# edit legend box and make patient ids small
theme(axis.text.y = element_text(size=6),
legend.position = c(.6, .9), legend.direction = "horizontal",
legend.background = element_rect(linetype="solid", colour ="black")) +
# remove extra space around timeline
scale_x_continuous(expand=c(0.01,0.01)) +
# set the color of the lines and points
scale_color_manual(values=c("dodgerblue","firebrick1"))
Warning: Removed 515 rows containing missing values (geom_point).
I have never made one of these plots for publication, so I don’t mind that the subtitle contains legend information. If you wanted to have the month_death
markers in a legend, you could change:
geom_point(aes(x = month_death, y = pt_id), col = "black", shape = 4)
to:
geom_point(aes(x = month_death, y = pt_id, shape = month_death), col = "black")
Example 2: Continuous Covariate with Missingness
We can follow the very same steps for making a timeline for a continuous variable.
Again, here is the data generation code so you can make these plots yourself:
<-
dat_cc tibble(.rows = n) %>%
mutate(pt_id = row_number(),
months_followup = round(runif(n, 0, 12)), # random months of follow up
pt_cov_mean = runif(n, 80, 150), # random mean of a patient's covariate measures
death = rbinom(n, 1, 0.5)) %>% # random death
group_by(pt_id) %>%
complete(months_followup = full_seq(0:max(months_followup), 1)) %>%
fill(pt_cov_mean, death, .direction = "up") %>%
mutate(last_month_followup = max(months_followup),
death_date = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
ungroup() %>%
mutate(cov = rnorm(row_number(), pt_cov_mean, 10)) # everyone's covariates are close to their original mean (use to see patterns later)
This time I am incorporating missingness at a patient-level on my simulated covariate measurements to show how we can use the plots to look at potential patterns in missingness.
<- dat_cc %>%
dat_cc_samp mutate(idx = row_number()) %>%
sample_frac(.4, weight = pt_cov_mean^3) %>% # sample 40% of data, with weights for the sample determined by the patient's mean covariate. This would mean patients with a higher mean covariate measure are more likely to have missing data.
pull(idx)
<- dat_cc %>%
dat_cc_miss mutate(cov = case_when(row_number() %in% dat_cc_samp ~ NA_real_,
TRUE ~ cov)) %>%
select(pt_id, months_followup, cov, death)
Looking again at the data we’ll be using:
::kable(head(dat_cc_miss)) knitr
pt_id | months_followup | cov | death |
---|---|---|---|
1 | 0 | 91.69544 | 0 |
2 | 0 | NA | 1 |
2 | 1 | NA | 1 |
2 | 2 | 156.57317 | 1 |
2 | 3 | NA | 1 |
2 | 4 | 172.23156 | 1 |
We can plot the data and see this missingness with gray timeline points and lines:
%>%
dat_cc_miss group_by(pt_id) %>%
mutate(last_month_followup = max(months_followup),
month_death = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
ungroup() %>%
mutate(pt_id = fct_rev(fct_reorder(factor(pt_id), last_month_followup))) %>%
ggplot(aes(x = months_followup, y = pt_id,
group = pt_id, col = cov)) +
geom_line() +
geom_point(shape = 15) +
geom_point(aes(x = month_death, y = pt_id), shape=4, col="black") +
theme_bw() +
labs(x = "Months of Follow-Up", y = "Patient ID", col = "Covariate",
title = "Patient Timelines: Continuous Covariate",
subtitle = "x indicates month of patient death, gray indicates missing covariate") +
theme(axis.text.y = element_text(size=6),
legend.position = c(.7, .9),
legend.direction = "horizontal",
legend.background = element_rect(linetype="solid", colour ="black")) +
scale_x_continuous(expand=c(0.01,0.01)) +
scale_color_gradient(low="dodgerblue", high="firebrick1", na.value = "lightgray",
breaks=c(90, 120, 150, 180))
Warning: Removed 143 rows containing missing values (geom_point).
Alternatively, we could rearrange our timelines by the number of measures we have for each patient’s cov
ariate of interest by making a new column containing the sum(!is.na(cov))
and rearranging our pt_id
by that column.
%>%
dat_cc_miss group_by(pt_id) %>%
mutate(last_month_followup = max(months_followup),
# a column containing how many measures we have for each patient
n_measures = sum(!is.na(cov))) %>%
ungroup() %>%
mutate(
# reorder IDs by number of measures we have for each patient
pt_id = fct_rev(fct_reorder(factor(pt_id), n_measures)),
month_death = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
ggplot(aes(x = months_followup, y = pt_id,
group = pt_id, col = cov)) +
geom_line() +
geom_point(shape = 15) +
geom_point(aes(x = month_death, y = pt_id), shape=4, col="black") +
theme_bw() +
labs(x = "Months of Follow-Up", y = "Patient ID", col = "Covariate",
title = "Patient Timelines: Continuous Covariate",
subtitle = "x indicates month of patient death, gray indicates missing covariate") +
theme(axis.text.y = element_text(size=6),
legend.position = c(.7, .9),
legend.direction = "horizontal",
legend.background = element_rect(linetype="solid", colour ="black")) +
scale_x_continuous(expand=c(0.01,0.01)) +
scale_color_gradient(low="dodgerblue",high="firebrick1",na.value = "lightgray",
breaks=c(90, 120, 150, 180))
Warning: Removed 143 rows containing missing values (geom_point).
From this plot we can start to see that patients with more missing measurements have higher covariate measurements. This becomes clearer if we rearrange patients by their proportion of non-missing measurements.
%>%
dat_cc_miss group_by(pt_id) %>%
mutate(last_month_followup = max(months_followup),
# the proportion is the total number of measures divided by the total months of followup
prop_measures = sum(!is.na(cov))/last_month_followup) %>%
ungroup() %>%
# reorder IDs by proportion of measures we have for each patient
mutate(pt_id = fct_rev(fct_reorder(factor(pt_id), prop_measures)),
month_death = case_when(death == 1 ~ last_month_followup, TRUE ~ NA_real_)) %>%
ggplot(aes(x = months_followup, y = pt_id,
group = pt_id, col = cov)) +
geom_line() +
geom_point(shape = 15) +
geom_point(aes(x = month_death, y = pt_id), shape=4, col="black") +
theme_bw() +
labs(x = "Months of Follow-Up", y = "Patient ID", col = "Covariate",
title = "Patient Timelines: Continuous Covariate",
subtitle = "x indicates month of patient death, gray indicates missing covariate") +
theme(axis.text.y = element_text(size=6), # move legend to the side by removing legend location
legend.background = element_rect(linetype="solid", colour ="black")) +
scale_x_continuous(expand=c(0.01,0.01)) +
scale_color_gradient(low="dodgerblue",high="firebrick1",na.value = "lightgray",
breaks=c(90, 120, 150, 180))
Warning: Removed 143 rows containing missing values (geom_point).
If this were my real data, I would follow up with my research collaborators for more information on the missingness mechanism.
Please let me know if you have questions or a suggestion for a data set. Happy treatment timeline plotting!
Session info
sessionInfo()
R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.7
Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] forcats_0.5.1 stringr_1.4.1 dplyr_1.0.9 purrr_0.3.4
[5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6
[9] tidyverse_1.3.1
loaded via a namespace (and not attached): [1] tidyselect_1.1.2 xfun_0.32 haven_2.5.0 colorspace_2.0-3 [5] vctrs_0.4.1 generics_0.1.3 htmltools_0.5.2 yaml_2.3.5
[9] utf8_1.2.2 rlang_1.0.4 pillar_1.8.1 glue_1.6.2
[13] withr_2.5.0 DBI_1.1.2 dbplyr_2.1.1 modelr_0.1.8
[17] readxl_1.4.0 lifecycle_1.0.1 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_1.0.2 htmlwidgets_1.5.4 evaluate_0.15
[25] labeling_0.4.2 knitr_1.38 tzdb_0.3.0 fastmap_1.1.0
[29] cabinets_0.6.0 fansi_1.0.3 highr_0.9 broom_0.8.0
[33] backports_1.4.1 scales_1.2.1 jsonlite_1.8.0 farver_2.1.1
[37] fs_1.5.2 gridExtra_2.3 hms_1.1.1 digest_0.6.29
[41] stringi_1.7.8 grid_4.1.3 cli_3.3.0 tools_4.1.3
[45] magrittr_2.0.3 crayon_1.5.1 pkgconfig_2.0.3 ellipsis_0.3.2
[49] xml2_1.3.3 reprex_2.0.1 lubridate_1.8.0 assertthat_0.2.1 [53] rmarkdown_2.13 httr_1.4.2 rstudioapi_0.13 R6_2.5.1
[57] compiler_4.1.3