TL;DR
If you’re ever felt limited by correlogram packages in
R
, this post will show you how to write your own function to tidy the many correlations into aggplot2
-friendly form for plotting.By the end, you will be able to run one function to get a tidied data frame of correlations:
formatted_cors(mtcars) %>% head() %>% kable()
measure1 | measure2 | r | n | p | sig_p | p_if_sig | r_if_sig |
---|---|---|---|---|---|---|---|
mpg | mpg | 1.0000000 | 32 | NA | NA | NA | NA |
mpg | cyl | -0.8521620 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8521620 |
mpg | disp | -0.8475514 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8475514 |
mpg | hp | -0.7761684 | 32 | 2.00e-07 | TRUE | 2.00e-07 | -0.7761684 |
mpg | drat | 0.6811719 | 32 | 1.78e-05 | TRUE | 1.78e-05 | 0.6811719 |
mpg | wt | -0.8676594 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8676594 |
- You can then run
ggplot2
code on this data to make your own correlation heat maps.
- If you just want the code, skip to the end.
Less-customizable options
I really appreciate some of the packages and functions that allow me to make correlation plots super quickly using R
. Here are a few examples:
corrplot::corrplot(cor(mtcars))
corrgram::corrgram(mtcars)
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
ggcorrplot::ggcorrplot(cor(mtcars))
All of these are nice, but none of them are ultimately as customizable as I need them to be. I’ll next show how you can bypass using someone else’s function constraints to prepare correlations in your data in a ggplot2
-friendly format.
Getting the correlations
We could use the base R
function cor()
to get our correlations, but I do not like the defaults for missing data. Instead, I use Frank Harrell’s Hmisc::rcorr()
function for two reasons:
it drops missing pairs as the default
it returns p-values, so you only need one function to get both the correlation coefficient and matching p-value
Let’s load the libraries we’ll need for this, which are knitr
for showing tables using kable
, and tidyverse
(we’ll specifically use tidyr
, dplyr
, ggplot2
, tibble
and purrr
).
library(knitr)
library(tidyverse, warn.conflict=F)
First, let’s look at our output from our correlation function we’ll use, Hmisc::rcorr()
. It requires the input to be a matrix, and outputs a list of three matrices.
mtcars_cor <- Hmisc::rcorr(as.matrix(mtcars))
These three matrices include the correlation coefficient (default is Pearson’s), r
, the p-value, P
, and the number of observations used for each correlation, n
. Let’s turn each matrix into a data frame
and look at the top six rows with head
and kable
.
The correlation coefficients, r
:
data.frame(mtcars_cor$r) %>% head() %>% kable()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1.0000000 | -0.8521620 | -0.8475514 | -0.7761684 | 0.6811719 | -0.8676594 | 0.4186840 | 0.6640389 | 0.5998324 | 0.4802848 | -0.5509251 |
cyl | -0.8521620 | 1.0000000 | 0.9020329 | 0.8324475 | -0.6999381 | 0.7824958 | -0.5912421 | -0.8108118 | -0.5226070 | -0.4926866 | 0.5269883 |
disp | -0.8475514 | 0.9020329 | 1.0000000 | 0.7909486 | -0.7102139 | 0.8879799 | -0.4336979 | -0.7104159 | -0.5912270 | -0.5555692 | 0.3949769 |
hp | -0.7761684 | 0.8324475 | 0.7909486 | 1.0000000 | -0.4487591 | 0.6587479 | -0.7082234 | -0.7230967 | -0.2432043 | -0.1257043 | 0.7498125 |
drat | 0.6811719 | -0.6999381 | -0.7102139 | -0.4487591 | 1.0000000 | -0.7124406 | 0.0912048 | 0.4402785 | 0.7127111 | 0.6996101 | -0.0907898 |
wt | -0.8676594 | 0.7824958 | 0.8879799 | 0.6587479 | -0.7124406 | 1.0000000 | -0.1747159 | -0.5549157 | -0.6924953 | -0.5832870 | 0.4276059 |
The p-values, P
:
data.frame(mtcars_cor$P) %>% head() %>% kable()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | NA | 0.0e+00 | 0.0e+00 | 0.0000002 | 0.0000178 | 0.00e+00 | 0.0170820 | 0.0000342 | 0.0002850 | 0.0054009 | 0.0010844 |
cyl | 0.00e+00 | NA | 0.0e+00 | 0.0000000 | 0.0000082 | 1.00e-07 | 0.0003661 | 0.0000000 | 0.0021512 | 0.0041733 | 0.0019423 |
disp | 0.00e+00 | 0.0e+00 | NA | 0.0000001 | 0.0000053 | 0.00e+00 | 0.0131440 | 0.0000052 | 0.0003662 | 0.0009636 | 0.0252679 |
hp | 2.00e-07 | 0.0e+00 | 1.0e-07 | NA | 0.0099888 | 4.15e-05 | 0.0000058 | 0.0000029 | 0.1798309 | 0.4930119 | 0.0000008 |
drat | 1.78e-05 | 8.2e-06 | 5.3e-06 | 0.0099888 | NA | 4.80e-06 | 0.6195826 | 0.0116755 | 0.0000047 | 0.0000084 | 0.6211834 |
wt | 0.00e+00 | 1.0e-07 | 0.0e+00 | 0.0000415 | 0.0000048 | NA | 0.3388683 | 0.0009798 | 0.0000113 | 0.0004587 | 0.0146386 |
The number of observations, n
. There are no missing data in the mtcars
data set so there are 32 pairs used for all correlations.
data.frame(mtcars_cor$n) %>% head(n=3) %>% kable()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
cyl | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
disp | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
Next we can write a function that formats a data frame
correctly for Hmisc::rcorr()
and then turns each of the three elements of the list (r
,n
and P
)
cors <- function(df) {
M <- Hmisc::rcorr(as.matrix(df))
# turn all three matrices (r, n, and P into a data frame)
Mdf <- map(M, ~data.frame(.x))
# return the three data frames in a list
return(Mdf)
}
Nothing too crazy happened in this function. Now we just have a list of three data frames. We can look at the the first element of our list using first()
, which shows us the correlations between all our variables:
cors(mtcars) %>% first() %>% head() %>% kable()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1.0000000 | -0.8521620 | -0.8475514 | -0.7761684 | 0.6811719 | -0.8676594 | 0.4186840 | 0.6640389 | 0.5998324 | 0.4802848 | -0.5509251 |
cyl | -0.8521620 | 1.0000000 | 0.9020329 | 0.8324475 | -0.6999381 | 0.7824958 | -0.5912421 | -0.8108118 | -0.5226070 | -0.4926866 | 0.5269883 |
disp | -0.8475514 | 0.9020329 | 1.0000000 | 0.7909486 | -0.7102139 | 0.8879799 | -0.4336979 | -0.7104159 | -0.5912270 | -0.5555692 | 0.3949769 |
hp | -0.7761684 | 0.8324475 | 0.7909486 | 1.0000000 | -0.4487591 | 0.6587479 | -0.7082234 | -0.7230967 | -0.2432043 | -0.1257043 | 0.7498125 |
drat | 0.6811719 | -0.6999381 | -0.7102139 | -0.4487591 | 1.0000000 | -0.7124406 | 0.0912048 | 0.4402785 | 0.7127111 | 0.6996101 | -0.0907898 |
wt | -0.8676594 | 0.7824958 | 0.8879799 | 0.6587479 | -0.7124406 | 1.0000000 | -0.1747159 | -0.5549157 | -0.6924953 | -0.5832870 | 0.4276059 |
Prep the correlations for ggplot2
The next step is to get the data ready for plotting with ggplot2
. We can keep the data in a list for now and use the map()
function from purrr
.
First, we need to move the rownames to their own column using tibble::rownames_to_column()
. The output of that looks like:
cors(mtcars) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
# look at the first element of the list (r)
first() %>%
head() %>%
kable()
measure1 | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 1.0000000 | -0.8521620 | -0.8475514 | -0.7761684 | 0.6811719 | -0.8676594 | 0.4186840 | 0.6640389 | 0.5998324 | 0.4802848 | -0.5509251 |
cyl | -0.8521620 | 1.0000000 | 0.9020329 | 0.8324475 | -0.6999381 | 0.7824958 | -0.5912421 | -0.8108118 | -0.5226070 | -0.4926866 | 0.5269883 |
disp | -0.8475514 | 0.9020329 | 1.0000000 | 0.7909486 | -0.7102139 | 0.8879799 | -0.4336979 | -0.7104159 | -0.5912270 | -0.5555692 | 0.3949769 |
hp | -0.7761684 | 0.8324475 | 0.7909486 | 1.0000000 | -0.4487591 | 0.6587479 | -0.7082234 | -0.7230967 | -0.2432043 | -0.1257043 | 0.7498125 |
drat | 0.6811719 | -0.6999381 | -0.7102139 | -0.4487591 | 1.0000000 | -0.7124406 | 0.0912048 | 0.4402785 | 0.7127111 | 0.6996101 | -0.0907898 |
wt | -0.8676594 | 0.7824958 | 0.8879799 | 0.6587479 | -0.7124406 | 1.0000000 | -0.1747159 | -0.5549157 | -0.6924953 | -0.5832870 | 0.4276059 |
Next, we can turn move of the columns to a single column called measure2
using tidyr::pivot_longer()
cors(mtcars) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
# format each data set (r,P,n) long
map(~pivot_longer(.x, -measure1, "measure2")) %>%
# look at the first element of the list (r)
first() %>%
head() %>%
kable()
measure1 | measure2 | value |
---|---|---|
mpg | mpg | 1.0000000 |
mpg | cyl | -0.8521620 |
mpg | disp | -0.8475514 |
mpg | hp | -0.7761684 |
mpg | drat | 0.6811719 |
mpg | wt | -0.8676594 |
Now, we’re ready to unlist our data by using bind_rows()
. This will turn our correlations into a very long data frame with all the rows from r
, then n
, then P
.
cors(mtcars) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
# format each data set (r,P,n) long
map(~pivot_longer(.x, -measure1, "measure2")) %>%
# merge our three list elements by binding the rows
bind_rows(.id = "id") %>%
head() %>%
kable()
id | measure1 | measure2 | value |
---|---|---|---|
r | mpg | mpg | 1.0000000 |
r | mpg | cyl | -0.8521620 |
r | mpg | disp | -0.8475514 |
r | mpg | hp | -0.7761684 |
r | mpg | drat | 0.6811719 |
r | mpg | wt | -0.8676594 |
For ggplot2
, we’ll need to have r
, n
, and p
as their own column. We can use pivot_longer()
to do this.
cors(mtcars) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
# format each data set (r,P,n) long
map(~pivot_longer(.x, -measure1, "measure2")) %>%
# merge our three list elements by binding the rows
bind_rows(.id = "id") %>%
pivot_wider(names_from = id, values_from = value) %>%
head() %>%
kable()
measure1 | measure2 | r | n | P |
---|---|---|---|---|
mpg | mpg | 1.0000000 | 32 | NA |
mpg | cyl | -0.8521620 | 32 | 0.00e+00 |
mpg | disp | -0.8475514 | 32 | 0.00e+00 |
mpg | hp | -0.7761684 | 32 | 2.00e-07 |
mpg | drat | 0.6811719 | 32 | 1.78e-05 |
mpg | wt | -0.8676594 | 32 | 0.00e+00 |
Finally, we can add a few columns that will potentially be useful later for making our correlation plots more informative. Let’s add columns that tell us whether the p-value was less than 0.05, and if so, give us back 1) the p-value and 2) the correlation coefficient, in case we want to label our plot with these values.
cors(mtcars) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
# format each data set (r,P,n) long
map(~pivot_longer(.x, -measure1, "measure2")) %>%
# merge our three list elements by binding the rows
bind_rows(.id = "id") %>%
pivot_wider(names_from = id, values_from = value) %>%
# change so everything is lower case
rename(p = P) %>%
mutate(sig_p = ifelse(p < .05, T, F),
p_if_sig = ifelse(sig_p, p, NA),
r_if_sig = ifelse(sig_p, r, NA)) %>%
head() %>%
kable()
measure1 | measure2 | r | n | p | sig_p | p_if_sig | r_if_sig |
---|---|---|---|---|---|---|---|
mpg | mpg | 1.0000000 | 32 | NA | NA | NA | NA |
mpg | cyl | -0.8521620 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8521620 |
mpg | disp | -0.8475514 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8475514 |
mpg | hp | -0.7761684 | 32 | 2.00e-07 | TRUE | 2.00e-07 | -0.7761684 |
mpg | drat | 0.6811719 | 32 | 1.78e-05 | TRUE | 1.78e-05 | 0.6811719 |
mpg | wt | -0.8676594 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8676594 |
This seems like everything I think I’ll ever ever want to plot. Of course you could add more. At this point I turned my formatted correlations into a function:
formatted_cors <- function(df){
cors(df) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
map(~pivot_longer(.x, -measure1, "measure2")) %>%
bind_rows(.id = "id") %>%
pivot_wider(names_from = id, values_from = value) %>%
rename(p = P) %>%
mutate(sig_p = ifelse(p < .05, T, F),
p_if_sig = ifelse(sig_p, p, NA),
r_if_sig = ifelse(sig_p, r, NA))
}
We can test the function works as expected:
formatted_cors(mtcars) %>% head() %>% kable()
measure1 | measure2 | r | n | p | sig_p | p_if_sig | r_if_sig |
---|---|---|---|---|---|---|---|
mpg | mpg | 1.0000000 | 32 | NA | NA | NA | NA |
mpg | cyl | -0.8521620 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8521620 |
mpg | disp | -0.8475514 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8475514 |
mpg | hp | -0.7761684 | 32 | 2.00e-07 | TRUE | 2.00e-07 | -0.7761684 |
mpg | drat | 0.6811719 | 32 | 1.78e-05 | TRUE | 1.78e-05 | 0.6811719 |
mpg | wt | -0.8676594 | 32 | 0.00e+00 | TRUE | 0.00e+00 | -0.8676594 |
Plotting
We’re finally ready to plot our correlation heat maps in ggplot2
.
The simplest form of this plot only requires us to specify measure1
and measure2
on the x
and y
-axis, respectively. Then we can map the correlation r
to the fill
aes
thetic, and add a tile as the geom
etry.
formatted_cors(mtcars) %>%
ggplot(aes(x = measure1, y = measure2, fill = r)) +
geom_tile()
We can make some minor aesthetic changes, such as the fill coloring scale, titles, and font family.
formatted_cors(mtcars) %>%
ggplot(aes(x = measure1, y = measure2, fill = r)) +
geom_tile() +
labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars") +
# map a red, white and blue color scale to correspond to -1:1 sequential gradient
scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
theme_classic() +
# remove excess space on x and y axes
scale_x_discrete(expand=c(0,0)) +
scale_y_discrete(expand=c(0,0)) +
# change global font to roboto
theme(text=element_text(family="Roboto"))
We can add the correlations for extra information. For this particular plot, I only added significant (p-value less than 0.05) correlations, using the column r_if_sig
that outputs from formatted_cors()
.
formatted_cors(mtcars) %>%
ggplot(aes(measure1, measure2, fill=r, label=round(r_if_sig,2))) +
geom_tile() +
labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars",
subtitle="Only significant Pearson's correlation coefficients shown") +
scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
geom_text() +
theme_classic() +
scale_x_discrete(expand=c(0,0)) +
scale_y_discrete(expand=c(0,0)) +
theme(text=element_text(family="Roboto"))
Another version of this could involve squares with different sizes to denote strength of correlation using geom_point
with shape
set to a value from these available geom_shape
s. Make sure you take the absolute value of the correlation so that strong negative correlations can also be denoted larger.
formatted_cors(mtcars) %>%
ggplot(aes(measure1, measure2, col=r)) + ## to get the rect filled
geom_tile(col="black", fill="white") +
geom_point(aes(size = abs(r)), shape=15) +
labs(x = NULL, y = NULL, col = "Pearson's\nCorrelation", title="Correlations in Mtcars") +
theme_classic()+
scale_color_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
scale_x_discrete(expand=c(0,0)) +
scale_y_discrete(expand=c(0,0)) +
theme(text=element_text(family="Roboto")) +
scale_size(range=c(1,11), guide=NULL)
Please feel free to reach out with questions or suggestions. Thank you to Elena Leib for spotting a minor bug in a previous version of this post!
Just the code
cors <- function(df) {
M <- Hmisc::rcorr(as.matrix(df))
Mdf <- map(M, ~data.frame(.x))
return(Mdf)
}
formatted_cors <- function(df){
cors(df) %>%
map(~rownames_to_column(.x, var="measure1")) %>%
map(~pivot_longer(.x, -measure1, "measure2")) %>%
bind_rows(.id = "id") %>%
pivot_wider(names_from = id, values_from = value) %>%
rename(p = P) %>%
mutate(sig_p = ifelse(p < .05, T, F),
p_if_sig = ifelse(sig_p, p, NA),
r_if_sig = ifelse(sig_p, r, NA))
}
formatted_cors(mtcars) %>%
ggplot(aes(measure1, measure2, fill=r, label=round(r_if_sig,2))) +
geom_tile() +
labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars",
subtitle="Only significant Pearson's correlation coefficients shown") +
scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
geom_text() +
theme_classic() +
scale_x_discrete(expand=c(0,0)) +
scale_y_discrete(expand=c(0,0)) +
theme(text=element_text(family="Roboto"))
sessionInfo()
R version 3.6.3 (2020-02-29) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.7
Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.30 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
[5] purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.4
[9] ggplot2_3.3.2 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 fs_1.4.1 lubridate_1.7.9
[4] RColorBrewer_1.1-2 httr_1.4.2 tools_3.6.3
[7] backports_1.1.8 R6_2.5.0 KernSmooth_2.23-16
[10] rpart_4.1-15 Hmisc_4.4-0 DBI_1.1.0
[13] colorspace_1.4-1 nnet_7.3-12 withr_2.2.0
[16] tidyselect_1.1.0 gridExtra_2.3 cabinets_0.6.0
[19] compiler_3.6.3 cli_2.2.0 rvest_0.3.5
[22] htmlTable_1.13.3 TSP_1.1-10 xml2_1.3.0
[25] labeling_0.3 bookdown_0.19 caTools_1.18.0
[28] scales_1.1.1 checkmate_2.0.0 digest_0.6.27
[31] foreign_0.8-75 rmarkdown_2.1 base64enc_0.1-3
[34] jpeg_0.1-8.1 pkgconfig_2.0.3 htmltools_0.5.0
[37] dbplyr_1.4.3 highr_0.8 htmlwidgets_1.5.1
[40] rlang_0.4.8 readxl_1.3.1 rstudioapi_0.11
[43] generics_0.1.0 farver_2.0.3 jsonlite_1.7.1
[46] gtools_3.8.2 dendextend_1.13.4 acepack_1.4.1
[49] magrittr_1.5 Formula_1.2-3 Matrix_1.2-18
[52] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1
[55] viridis_0.5.1 lifecycle_0.2.0 stringi_1.5.3
[58] yaml_2.2.1 MASS_7.3-51.5 plyr_1.8.6
[61] gplots_3.1.1 grid_3.6.3 crayon_1.3.4
[64] lattice_0.20-38 haven_2.2.0 splines_3.6.3
[67] hms_0.5.3 pillar_1.4.6 reshape2_1.4.3
[70] codetools_0.2-16 reprex_0.3.0 glue_1.4.2
[73] gclus_1.3.2 evaluate_0.14 blogdown_0.19
[76] latticeExtra_0.6-29 data.table_1.13.2 modelr_0.1.6
[79] png_0.1-7 vctrs_0.3.5 foreach_1.5.0
[82] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
[85] xfun_0.19 broom_0.7.0 ggcorrplot_0.1.3
[88] viridisLite_0.3.0 survival_3.1-11 seriation_1.2-8
[91] corrgram_1.13 iterators_1.0.12 registry_0.5-1
[94] cluster_2.1.0 corrplot_0.84 ellipsis_0.3.1