Customizable correlation plots in R


TL;DR

  • If you’re ever felt limited by correlogram packages in R, this post will show you how to write your own function to tidy the many correlations into a ggplot2-friendly form for plotting.

  • By the end, you will be able to run one function to get a tidied data frame of correlations:

formatted_cors(mtcars) %>% head() %>% kable()
measure1 measure2 r n P sig_p p_if_sig r_if_sig
mpg mpg 1.0000000 32 NA NA NA NA
mpg cyl -0.8521620 32 0.00e+00 TRUE 0.00e+00 -0.8521620
mpg disp -0.8475514 32 0.00e+00 TRUE 0.00e+00 -0.8475514
mpg hp -0.7761684 32 2.00e-07 TRUE 2.00e-07 -0.7761684
mpg drat 0.6811719 32 1.78e-05 TRUE 1.78e-05 0.6811719
mpg wt -0.8676594 32 0.00e+00 TRUE 0.00e+00 -0.8676594
  • You can then run ggplot2 code on this data to make your own correlation heat maps.


Less-customizable options

I really appreciate some of the packages and functions that allow me to make correlation plots super quickly using R. Here are a few examples:

corrplot::corrplot(cor(mtcars))

corrgram::corrgram(mtcars)
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

ggcorrplot::ggcorrplot(cor(mtcars))

All of these are nice, but none of them are ultimately as customizable as I need them to be. I’ll next show how you can bypass using someone else’s function constraints to prepare correlations in your data in a ggplot2-friendly format.

Getting the correlations

We could use the base R function cor() to get our correlations, but I do not like the defaults for missing data. Instead, I use Frank Harrell’s Hmisc::rcorr() function for two reasons:

  1. it drops missing pairs as the default

  2. it returns p-values, so you only need one function to get both the correlation coefficient and matching p-value

Let’s load the libraries we’ll need for this, which are knitr for showing tables using kable, and tidyverse (we’ll specifically use tidyr, dplyr, ggplot2, tibble and purrr).

library(knitr)
library(tidyverse, warn.conflict=F)

First, let’s look at our output from our correlation function we’ll use, Hmisc::rcorr(). It requires the input to be a matrix, and outputs a list of three matrices.

mtcars_cor <- Hmisc::rcorr(as.matrix(mtcars))

These three matrices include the correlation coefficient (default is Pearson’s), r, the p-value, P, and the number of observations used for each correlation, n. Let’s turn each matrix into a data frame and look at the top six rows with head and kable.

The correlation coefficients, r:

data.frame(mtcars_cor$r) %>% head() %>% kable()
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958 -0.5912421 -0.8108118 -0.5226070 -0.4926866 0.5269883
disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799 -0.4336979 -0.7104159 -0.5912270 -0.5555692 0.3949769
hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479 -0.7082234 -0.7230967 -0.2432043 -0.1257043 0.7498125
drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 -0.7124406 0.0912048 0.4402785 0.7127111 0.6996101 -0.0907898
wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000 -0.1747159 -0.5549157 -0.6924953 -0.5832870 0.4276059

The p-values, P:

data.frame(mtcars_cor$P) %>% head() %>% kable()
mpg cyl disp hp drat wt qsec vs am gear carb
mpg NA 0.0e+00 0.0e+00 0.0000002 0.0000178 0.00e+00 0.0170820 0.0000342 0.0002850 0.0054009 0.0010844
cyl 0.00e+00 NA 0.0e+00 0.0000000 0.0000082 1.00e-07 0.0003661 0.0000000 0.0021512 0.0041733 0.0019423
disp 0.00e+00 0.0e+00 NA 0.0000001 0.0000053 0.00e+00 0.0131440 0.0000052 0.0003662 0.0009636 0.0252679
hp 2.00e-07 0.0e+00 1.0e-07 NA 0.0099888 4.15e-05 0.0000058 0.0000029 0.1798309 0.4930119 0.0000008
drat 1.78e-05 8.2e-06 5.3e-06 0.0099888 NA 4.80e-06 0.6195826 0.0116755 0.0000047 0.0000084 0.6211834
wt 0.00e+00 1.0e-07 0.0e+00 0.0000415 0.0000048 NA 0.3388683 0.0009798 0.0000113 0.0004587 0.0146386

The number of observations, n. There are no missing data in the mtcars data set so there are 32 pairs used for all correlations.

data.frame(mtcars_cor$n) %>% head(n=3) %>% kable()
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 32 32 32 32 32 32 32 32 32 32 32
cyl 32 32 32 32 32 32 32 32 32 32 32
disp 32 32 32 32 32 32 32 32 32 32 32

Next we can write a function that formats a data frame correctly for Hmisc::rcorr() and then turns each of the three elements of the list (r,n and P)

cors <- function(df) {
  M <- Hmisc::rcorr(as.matrix(df))
  # turn all three matrices (r, n, and P into a data frame)
  Mdf <- map(M, ~data.frame(.x))
  # return the three data frames in a list
  return(Mdf)
}

Nothing too crazy happened in this function. Now we just have a list of three data frames. We can look at the the first element of our list using first(), which shows us the correlations between all our variables:

cors(mtcars) %>% first() %>% head() %>% kable()
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958 -0.5912421 -0.8108118 -0.5226070 -0.4926866 0.5269883
disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799 -0.4336979 -0.7104159 -0.5912270 -0.5555692 0.3949769
hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479 -0.7082234 -0.7230967 -0.2432043 -0.1257043 0.7498125
drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 -0.7124406 0.0912048 0.4402785 0.7127111 0.6996101 -0.0907898
wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000 -0.1747159 -0.5549157 -0.6924953 -0.5832870 0.4276059

Prep the correlations for ggplot2

The next step is to get the data ready for plotting with ggplot2. We can keep the data in a list for now and use the map() function from purrr.

First, we need to move the rownames to their own column using tibble::rownames_to_column(). The output of that looks like:

cors(mtcars) %>%
  map(~rownames_to_column(.x, var="measure1")) %>%
  # look at the first element of the list (r)
  first() %>%
  head() %>%
  kable()
measure1 mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840 0.6640389 0.5998324 0.4802848 -0.5509251
cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958 -0.5912421 -0.8108118 -0.5226070 -0.4926866 0.5269883
disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799 -0.4336979 -0.7104159 -0.5912270 -0.5555692 0.3949769
hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479 -0.7082234 -0.7230967 -0.2432043 -0.1257043 0.7498125
drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 -0.7124406 0.0912048 0.4402785 0.7127111 0.6996101 -0.0907898
wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000 -0.1747159 -0.5549157 -0.6924953 -0.5832870 0.4276059

Next, we can turn move of the columns to a single column called measure2 using tidyr::pivot_longer()

cors(mtcars) %>%
  map(~rownames_to_column(.x, var="measure1")) %>%
  # format each data set (r,P,n) long
  map(~pivot_longer(.x, -measure1, "measure2")) %>%
  # look at the first element of the list (r)
  first() %>%
  head() %>%
  kable()
measure1 measure2 value
mpg mpg 1.0000000
mpg cyl -0.8521620
mpg disp -0.8475514
mpg hp -0.7761684
mpg drat 0.6811719
mpg wt -0.8676594

Now, we’re ready to unlist our data by using bind_rows(). This will turn our correlations into a very long data frame with all the rows from r, then n, then P.

cors(mtcars) %>%
  map(~rownames_to_column(.x, var="measure1")) %>%
  # format each data set (r,P,n) long
  map(~pivot_longer(.x, -measure1, "measure2")) %>%
  # merge our three list elements by binding the rows
  bind_rows(.id = "id") %>%
  head() %>%
  kable()
id measure1 measure2 value
r mpg mpg 1.0000000
r mpg cyl -0.8521620
r mpg disp -0.8475514
r mpg hp -0.7761684
r mpg drat 0.6811719
r mpg wt -0.8676594

For ggplot2, we’ll need to have r, n, and p as their own column. We can use pivot_longer() to do this.

cors(mtcars) %>%
  map(~rownames_to_column(.x, var="measure1")) %>%
  # format each data set (r,P,n) long
  map(~pivot_longer(.x, -measure1, "measure2")) %>%
  # merge our three list elements by binding the rows
  bind_rows(.id = "id") %>%
  pivot_wider(names_from = id, values_from = value) %>%
  head() %>%
  kable()
measure1 measure2 r n P
mpg mpg 1.0000000 32 NA
mpg cyl -0.8521620 32 0.00e+00
mpg disp -0.8475514 32 0.00e+00
mpg hp -0.7761684 32 2.00e-07
mpg drat 0.6811719 32 1.78e-05
mpg wt -0.8676594 32 0.00e+00

Finally, we can add a few columns that will potentially be useful later for making our correlation plots more informative. Let’s add columns that tell us whether the p-value was less than 0.05, and if so, give us back 1) the p-value and 2) the correlation coefficient, in case we want to label our plot with these values.

cors(mtcars) %>%
  map(~rownames_to_column(.x, var="measure1")) %>%
  # format each data set (r,P,n) long
  map(~pivot_longer(.x, -measure1, "measure2")) %>%
  # merge our three list elements by binding the rows
  bind_rows(.id = "id") %>%
  pivot_wider(names_from = id, values_from = value) %>%
  mutate(sig_p = ifelse(P < .05, T, F),
           p_if_sig = ifelse(P <.05, P, NA),
           r_if_sig = ifelse(r <.05, r, NA)) %>%
  head() %>%
  kable()
measure1 measure2 r n P sig_p p_if_sig r_if_sig
mpg mpg 1.0000000 32 NA NA NA NA
mpg cyl -0.8521620 32 0.00e+00 TRUE 0.00e+00 -0.8521620
mpg disp -0.8475514 32 0.00e+00 TRUE 0.00e+00 -0.8475514
mpg hp -0.7761684 32 2.00e-07 TRUE 2.00e-07 -0.7761684
mpg drat 0.6811719 32 1.78e-05 TRUE 1.78e-05 NA
mpg wt -0.8676594 32 0.00e+00 TRUE 0.00e+00 -0.8676594

This seems like everything I think I’ll ever ever want to plot. Of course you could add more. At this point I turned my formatted correlations into a function:

formatted_cors <- function(df){
  cors(df) %>%
    map(~rownames_to_column(.x, var="measure1")) %>%
    map(~pivot_longer(.x, -measure1, "measure2")) %>%
    bind_rows(.id = "id") %>%
    pivot_wider(names_from = id, values_from = value) %>%
    mutate(sig_p = ifelse(P < .05, T, F),
           p_if_sig = ifelse(P <.05, P, NA),
           r_if_sig = ifelse(P <.05, r, NA))
}

We can test the function works as expected:

formatted_cors(mtcars) %>% head() %>% kable()
measure1 measure2 r n P sig_p p_if_sig r_if_sig
mpg mpg 1.0000000 32 NA NA NA NA
mpg cyl -0.8521620 32 0.00e+00 TRUE 0.00e+00 -0.8521620
mpg disp -0.8475514 32 0.00e+00 TRUE 0.00e+00 -0.8475514
mpg hp -0.7761684 32 2.00e-07 TRUE 2.00e-07 -0.7761684
mpg drat 0.6811719 32 1.78e-05 TRUE 1.78e-05 0.6811719
mpg wt -0.8676594 32 0.00e+00 TRUE 0.00e+00 -0.8676594

Plotting

We’re finally ready to plot our correlation heat maps in ggplot2.

The simplest form of this plot only requires us to specify measure1 and measure2 on the x and y-axis, respectively. Then we can map the correlation r to the fill aesthetic, and add a tile as the geometry.

formatted_cors(mtcars) %>%
  ggplot(aes(x = measure1, y = measure2, fill = r)) +
  geom_tile()

We can make some minor aesthetic changes, such as the fill coloring scale, titles, and font family.

formatted_cors(mtcars) %>%
  ggplot(aes(x = measure1, y = measure2, fill = r)) +
  geom_tile() +
  labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars") +
  # map a red, white and blue color scale to correspond to -1:1 sequential gradient
  scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
  theme_classic() +
  # remove excess space on x and y axes
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  # change global font to roboto
  theme(text=element_text(family="Roboto"))

We can add the correlations for extra information. For this particular plot, I only added significant (p-value less than 0.05) correlations, using the column r_if_sig that outputs from formatted_cors().

formatted_cors(mtcars) %>%
  ggplot(aes(measure1, measure2, fill=r, label=round(r_if_sig,2))) +
  geom_tile() +
  labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars",
       subtitle="Only significant Pearson's correlation coefficients shown") +
  scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
  geom_text() +
  theme_classic() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme(text=element_text(family="Roboto"))

Another version of this could involve squares with different sizes to denote strength of correlation using geom_point with shape set to a value from these available geom_shapes. Make sure you take the absolute value of the correlation so that strong negative correlations can also be denoted larger.

formatted_cors(mtcars) %>%
  ggplot(aes(measure1, measure2, col=r)) + ## to get the rect filled
  geom_tile(col="black", fill="white") +
  geom_point(aes(size = abs(r)), shape=15) +
  labs(x = NULL, y = NULL, col = "Pearson's\nCorrelation", title="Correlations in Mtcars") +
  theme_classic()+
  scale_color_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1))  +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme(text=element_text(family="Roboto")) +
  scale_size(range=c(1,11), guide=NULL) 

Just the code

cors <- function(df) {
  M <- Hmisc::rcorr(as.matrix(df))
  Mdf <- map(M, ~data.frame(.x))
  return(Mdf)
}

formatted_cors <- function(df){
  cors(df) %>%
    map(~rownames_to_column(.x, var="measure1")) %>%
    map(~pivot_longer(.x, -measure1, "measure2")) %>%
    bind_rows(.id = "id") %>%
    pivot_wider(names_from = id, values_from = value) %>%
    mutate(sig_p = ifelse(P < .05, T, F),
           p_if_sig = ifelse(P <.05, P, NA),
           r_if_sig = ifelse(P <.05, r, NA))
}

formatted_cors(mtcars) %>%
  ggplot(aes(measure1, measure2, fill=r, label=round(r_if_sig,2))) +
  geom_tile() +
  labs(x = NULL, y = NULL, fill = "Pearson's\nCorrelation", title="Correlations in Mtcars",
       subtitle="Only significant Pearson's correlation coefficients shown") +
  scale_fill_gradient2(mid="#FBFEF9",low="#0C6291",high="#A63446", limits=c(-1,1)) +
  geom_text() +
  theme_classic() +
  scale_x_discrete(expand=c(0,0)) +
  scale_y_discrete(expand=c(0,0)) +
  theme(text=element_text(family="Roboto"))

Avatar
Katherine Hoffman
Research Biostatistician I

I am passionate about meaningful, reproducible medical research.

Related