# Become a Superlearner! A Visual Guide & Introductory R Tutorial on Superlearning

Why use one machine learning algorithm when you could use all of them?! This post contains a step-by-step walkthrough of how to build a superlearner prediction algorithm in R.

HTML Image as link # Supercuts of superlearning

• Superlearning is a technique for prediction that involves combining many individual statistical algorithms (commonly called “data-adaptive” or “machine learning” algorithms) to create a new, single prediction algorithm that is at least as good as any of the individual algorithms.

• The superlearner algorithm “decides” how to combine, or weight, the individual algorithms based upon how well each one minimizes a specified loss function, for example, the mean squared error (MSE). This is done using cross-validation to avoid overfitting.

• The motivation for this type of “ensembling” is that no single algorithm is always the best for all kinds of data. For example, sometimes a penalized regression may be best, but in other situations a random forest may be superior. Even extreme gradient boosting is not always ideal! Superlearning allows you to use the beneficial information each algorithm provides to ultimately optimize your prediction capabilities.

• Superlearning is also called stacking, stacked generalizations, or weighted ensembling by different specializations within the realms of statistics and data science. # Superlearning, step by step

First I’ll go through the algorithm one step at a time using a simulated data set.

## Step 0: Load libraries, set seed, simulate data

For simplicity I’ll show the concept of superlearning using only four variables (AKA features or predictors) to predict a continuous outcome. Let’s first simulate a continuous outcome, y, and four potential predictors, x1, x2, x3, and x4.

library(tidyverse)
library(kableExtra)
set.seed(7)
n <- 5000
obs <- tibble(
id = 1:n,
x1 = rnorm(n),
x2 = rbinom(n, 1, plogis(10*x1)),
x3 = rbinom(n, 1, plogis(x1*x2 + .5*x2)),
x4 = rnorm(n, mean=x1*x2, sd=.5*x3),
y = x1 + x2 + x2*x3 + sin(x4)
)
kable(head(obs), digits=3, caption = "Simulated data set")
Table 1: Simulated data set
id x1 x2 x3 x4 y
1 2.287 1 1 1.385 5.270
2 -1.197 0 0 0.000 -1.197
3 -0.694 0 0 0.000 -0.694
4 -0.412 0 1 -0.541 -0.928
5 -0.971 0 0 0.000 -0.971
6 -0.947 0 1 -0.160 -1.107

## Step 1: Split data into K folds The superlearner algorithm relies on K-fold cross-validation (CV) to avoid overfitting. We will start this process by splitting the data into 10 folds. The easiest way to do this is by creating indices for each CV fold.

k <- 10 # 10 fold cv
cv_index <- sample(rep(1:k, each = n/k)) # indices for each fold, same length as our dataset obs

## Step 2: Fit base learners for first CV-fold Recall that in K-fold CV, each fold serves as the validation set one time. In this first round of CV, we will train all of our base learners on all the CV folds (k = 1,2,…,9) except for the very last one: cv_index == 10.

The individual algorithms or base learners that we’ll use here are three linear regressions with differently specified parameters:

1. Learner A: $$Y=\beta_0 + \beta_1 X_2 + \beta_2 X_4 + \epsilon$$

2. Learner B: $$Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_3 + \beta_4 sin(X_4) + \epsilon$$

3. Learner C: $$Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_1 X_2 + \beta_5 X_1 X_3 + \beta_6 X_2 X_3 + \beta_7 X_1 X_2 X_3 + \epsilon$$

cv_train_1 <- obs[-which(cv_index == 10),] # make a data set that contains all observations except those in k=1
fit_1a <- glm(y ~ x2 + x4, data=cv_train_1) # fit the first linear regression on that training data
fit_1b <- glm(y ~ x1 + x2 + x1*x3 + sin(x4), data=cv_train_1) # second LR fit on the training data
fit_1c <- glm(y ~ x1*x2*x3, data=cv_train_1) # and the third LR

I am only using the linear regressions so that code for running more complicated regressions does not take away from understanding the general superlearning algorithm.

Superlearning actually works best if you use a diverse set, or superlearner library, of base learners. For example, instead of three linear regressions, we could use a least absolute shrinkage estimator (LASSO), random forest, and multivariate adaptive splines (MARS). Any parametric or non-parametric supervised machine learning algorithm can be included as a base learner.

## Step 3: Obtain predictions for first CV-fold We can then get use our validation data, cv_index == 10, to obtain our first set of cross-validated predictions.

cv_valid_1 <- obs[which(cv_index == 10),] # make a data set that only contains observations except in k=10
pred_1a <- predict(fit_1a, newdata = cv_valid_1) # use that data set as the validation for all the models in the SL library
pred_1b <- predict(fit_1b, newdata = cv_valid_1)
pred_1c <- predict(fit_1c, newdata = cv_valid_1)

Since we have 5000 observations, that gives us three vectors of length 500: a set of predictions for each of our Learners A, B, and C.

length(pred_1a) # double check we only have n/k predictions ...we do :-)
##  500
knitr::kable(head(cbind(pred_1a, pred_1b, pred_1c)), digits= 2, caption = "First CV round of predictions") 
Table 2: First CV round of predictions
pred_1a pred_1b pred_1c
-1.39 -0.77 -0.40
-1.27 -0.34 -0.11
2.16 1.32 1.10
4.27 4.26 3.98
3.31 3.98 3.78
2.29 2.42 2.83

## Step 4: Obtain CV predictions for entire data set We’ll want to get those predictions for every fold. So, using your favorite for loop, apply statement, or mapping function, fit the base learners and obtain predictions for each of them, so that there are 1000 predictions – one for every point in observations.

cv_folds <- as.list(1:k)
names(cv_folds) <- paste0("fold",1:k)
cv_preds <-
purrr::map_dfr(cv_folds, function(x){  # map_dfr loops through every fold (1:k) and binds the rows of the listed results together
cv_train <- obs[-which(cv_index == x),]  # make a training data set that contains all data except fold k
fit_a <- glm(y ~ x2 + x4, data=cv_train)  # fit all the base learners to that data
fit_b <- glm(y ~ x1 + x2 + x1*x3 + sin(x4), data=cv_train)
fit_c <- glm(y ~ x1*x2*x3, data=cv_train)
cv_valid <- obs[which(cv_index == x),]  # make a validation data set that only contains data from fold k
pred_a <- predict(fit_a, newdata = cv_valid)  # obtain predictions from all the base learners for that validation data
pred_b <- predict(fit_b, newdata = cv_valid)
pred_c <- predict(fit_c, newdata = cv_valid)
return(tibble("obs_id" = cv_valid$id, "cv_fold" = x, pred_a, pred_b, pred_c)) # save the predictions and the ids of the observations in a data frame }) cv_preds %>% arrange(obs_id) %>% head() %>% kable(digits=2, caption = "All CV predictions for all three base learners")  Table 3: All CV predictions for all three base learners obs_id cv_fold pred_a pred_b pred_c 1 4 3.73 5.42 5.28 2 8 -0.77 -1.19 -1.20 3 2 -0.78 -0.81 -0.69 4 10 -1.39 -0.77 -0.40 5 6 -0.78 -1.01 -0.97 6 7 -0.96 -1.04 -0.94 ## Step 5: Choose and compute loss function of interest via metalearner This is the key step of the superlearner algorithm: we will use a new learner, a metalearner, to take information from all of the base learners and create that new algorithm. Now that we have cross-validated predictions for every observation in the data set, we want to merge those CV predictions back into our main data set… obs_preds <- full_join(obs, cv_preds, by=c("id" = "obs_id")) …so that we can minimize a final loss function of interest between the true outcome and each CV prediction. This is how we’re going to optimize our overall prediction algorithm: we want to make sure we’re “losing the least” in the way we combine our base learners’ predictions to ultimately make final predictions. We can do this efficiently by choosing a new learner, a metalearner, which reflects the final loss function of interest. For simplicity, we’ll use another linear regression as our metalearner. Using a linear regression as a metalearner will minimize the Mean Squared Error (MSE) when combining the base learner predictions. Note that we could use a variety of parametric or non-parametric regressions to minimize the MSE. No matter what metalearner we choose, the predictors will always be the cross-validated predictions from each base learner, and the outcome will always be the true outcome, y. sl_fit <- glm(y ~ pred_a + pred_b + pred_c, data = obs_preds) kable(broom::tidy(sl_fit), digits=3, caption = "Metalearner regression coefficients")  Table 4: Metalearner regression coefficients term estimate std.error statistic p.value (Intercept) -0.003 0.002 -1.447 0.148 pred_a -0.017 0.004 -4.739 0.000 pred_b 0.854 0.007 128.241 0.000 pred_c 0.165 0.005 30.103 0.000 This metalearner provides us with the coefficients, or weights, to apply to each of the base learners. In other words, if we have a set of predictions from Learner A, B, and C, we can obtain our best possible predictions by starting with an intercept of -0.003, then adding -0.017 $$\times$$ predictions from Learner A, 0.854 $$\times$$ predictions from Learner B, and 0.165 $$\times$$ predictions from Learner C. For more information on the metalearning step, check out the Appendix. ## Step 6: Fit base learners on entire data set After we fit the metalearner, we officially have our superlearner algorithm, so it’s time to input data and obtain predictions! To implement the algorithm and obtain final predictions, we first need to fit the base learners on the full data set. fit_a <- glm(y ~ x2 + x4, data=obs) fit_b <- glm(y ~ x1 + x2 + x1*x3 + sin(x4), data=obs) fit_c <- glm(y ~ x1*x2*x3, data=obs) ## Step 7: Obtain predictions from each base learner on entire data set We’ll use those base learner fits to get predictions from each of the base learners for the entire data set, and then we will plug those predictions into the metalearner fit. Remember, we were previously using cross-validated predictions, rather than fitting the base learners on the whole data set. This was to avoid overfitting. pred_a <- predict(fit_a) pred_b <- predict(fit_b) pred_c <- predict(fit_c) full_data_preds <- tibble(pred_a, pred_b, pred_c) ## Step 8: Use metalearner fit to weight base learners Once we have the predictions from the full data set, we can input them to the metalearner, where the output will be a final prediction for y. sl_predictions <- predict(sl_fit, newdata = full_data_preds) kable(head(sl_predictions), col.names = "sl_predictions", digits= 2, caption = "Final SL predictions (manual)")  Table 5: Final SL predictions (manual) sl_predictions 5.44 -1.20 -0.79 -0.71 -1.02 -1.03 And… that’s it! Those are our superlearner predictions. ## Step 9 and beyond… We could compute the MSE of the ensemble superlearner predictions. sl_mse <- mean((obs$y - sl_predictions)^2)
sl_mse
##  0.01927392

We could also add more algorithms to our base learner library (we definitely should, since we only used linear regressions!), and we could write functions to tune these algorithms’ hyperparameters over various grids. For example, if we were to include random forest in our library, we may want to tune over a number of trees and maximum bucket sizes.

We can then cross-validate this entire process to evaluate the predictive performance of our superlearner algorithm. Alternatively, we could leave a hold-out training data set to evaluate the performance.

# Using the SuperLearner package

Or… we could use a package and avoid all the hand-coding. Here is how you would build an ensemble superlearner for our data with the base learner libraries of ranger (random forests), glmnet (LASSO, by default), and earth (MARS) using the SuperLearner package in R:

library(SuperLearner)
x_df <- obs %>% select(x1:x4) %>% as.data.frame()
sl_fit <- SuperLearner(Y = obs$y, X = x_df, family = gaussian(), SL.library = c("SL.ranger", "SL.glmnet", "SL.earth")) You can specify the metalearner with the method argument. The default is Non-Negative Least Squares (NNLS). ## CV-Risk and Coefficient Weights We can examine the cross-validated Risk (loss function), and the Coefficient (weight) given to each of the models. sl_fit ## ## Call: ## SuperLearner(Y = obs$y, X = x_df, family = gaussian(), SL.library = c("SL.ranger",
##     "SL.glmnet", "SL.earth"))
##
##
##                      Risk      Coef
## SL.ranger_All 0.013278476 0.1619231
## SL.glmnet_All 0.097149642 0.0000000
## SL.earth_All  0.003168299 0.8380769

From this summary we can see that the CV-risk (the default risk is MSE) in this library of base learners is smallest for SL.Earth. This translates to the largest coefficient, or weight, given to the predictions from earth.

The LASSO model implemented by glmnet has the largest CV-risk, and after the metalearning step, those predictions receive a coefficient, or weight, of 0. This means that the predictions from LASSO will not be incorporated into the final predictions at all.

## Obtaining the predictions

We can extract the predictions easily via the SL.predict element of the SuperLearner fit object.

kable(head(sl_fit$SL.predict), digits=2, col.names = "sl_predictions", caption = "Final SL predictions (package)")  Table 6: Final SL predictions (package) sl_predictions 5.29 -1.19 -0.68 -0.87 -0.97 -1.08 ## Cross-validated Superlearner Recall that we can cross-validate the entire model fitting process to evaluate the predictive performance of our superlearner algorithm. This is easy with the function CV.SuperLearner(). Beware, this gets computationally burdensome very quickly! cv_sl_fit <- CV.SuperLearner(Y = obs$y, X = x_df, family = gaussian(),
SL.library = c("SL.ranger", "SL.glmnet", "SL.earth"))

For more information on the SuperLearner package, take a look at this vignette.

## Alternative packages to superlearn

Other packages freely available in R that can be used to implement the superlearner algorithm include sl3 (an update to the older Superlearner package), h2o, ml3, and caretEnsemble. I previously wrote a brief demo on using sl3 for an NYC R-Ladies demo.

Python aficionados might find this blog post useful. I have never performed superlearning in Python, but if I had to, I would probably try h2o first. H2o is available in many programming languages and their Chief Machine Learning Scientist, Erin Ledell, was an author of the original SuperLearner package.

# Coming soon… when prediction is not the end goal

Like most algorithms designed for prediction, the superlearner algorithm does not produce standard errors, making statistical inference such as confidence intervals and p-values impossible. When prediction is not the end goal, superlearning works well with semi-parametric estimation methods (for example, Targeted Maximum Likelihood Estimation (TMLE)) for statistical inference. This allows the use of flexible models and minimal assumptions placed on the distribution of the data.

I made a similar visual guide for TMLE. If you found this superlearning tutorial helpful, check back here later for another one on TMLE. Alternatively, you can find me on Medium: @kathoffman317. If you’re curious about TMLE in the meantime, I really like this tutorial by Miguel Angel Luque Fernandez.

HTML Image as link # Appendix

### Manually computing the MSE

Let’s say we have chosen our loss function of interest to be the Mean Squared Error (MSE). We could first compute the squared error $$(y - \hat{y})^2$$ for each CV prediction A, B, and C.

cv_sq_error <-
obs_preds %>%
mutate(cv_sqrd_error_a = (y-pred_a)^2,   # compute squared error for each observation
cv_sqrd_error_b = (y-pred_b)^2,
cv_sqrd_error_c = (y-pred_c)^2)
cv_sq_error %>%
pivot_longer(c(cv_sqrd_error_a, cv_sqrd_error_b, cv_sqrd_error_c), # make the CV squared errors long form for plotting
names_to = "base_learner",
values_to = "squared_error") %>%
mutate(base_learner = toupper(str_remove(base_learner, "cv_sqrd_error_"))) %>%
ggplot(aes(base_learner, squared_error, col=base_learner)) + # make box plots
geom_boxplot() +
theme_bw() +
guides(col=F) +
labs(x = "Base Learner", y="Squared Error", title="Squared Errors of Learner A, B, and C") And then take the mean of those three cross-validated squared error columns, grouped by cv_fold, to get the CV-MSE for each fold.

cv_risks <-
cv_sq_error %>%
group_by(cv_fold) %>%
summarise(cv_mse_a = mean(cv_sqrd_error_a),
cv_mse_b = mean(cv_sqrd_error_b),
cv_mse_c = mean(cv_sqrd_error_c)
)
## summarise() ungrouping output (override with .groups argument)
cv_risks %>%
pivot_longer(cv_mse_a:cv_mse_c,
names_to = "base_learner",
values_to = "mse") %>%
mutate(base_learner = toupper(str_remove(base_learner,"cv_mse_")))  %>%
ggplot(aes(cv_fold, mse, col=base_learner)) +
geom_point() +
theme_bw()  +
scale_x_continuous(breaks = 1:10) +
labs(x = "Cross-Validation (CV) Fold", y="Mean Squared Error (MSE)", col = "Base Learner", title="CV-MSEs for Base Learners A, B, and C") We see that across each fold, Learner B consistently has an MSE around 0.02, while Learner C hovers around 0.1, and Learner A varies between 0.35 and .45. We can take another mean to get the overall CV-MSE for each learner.

cv_risks %>%
select(-cv_fold) %>%
summarise_all(mean) %>%
kable(digits=2, caption = "CV-MSE for each base learner") %>%
kable_styling(position = "center")
Table 7: CV-MSE for each base learner
cv_mse_a cv_mse_b cv_mse_c
0.38 0.02 0.11

The base learner that performs the best using our chosen loss function of interest is clearly Learner B. We can see from our data simulation code why this is true – Learner B is almost exactly the mimicking the data generating mechanism of y.

Our results align with the linear regression fit from our metalearning step; Learner B predictions received a much larger coefficient relative to Learners A and C.

### Discrete Superlearner

We could stop after minimizing our loss function (MSE) and fit Learner B to our full data set, and that would be called using the discrete superlearner.

discrete_sl_predictions <- predict(glm(y ~ x1 + x2 + x1*x3 + sin(x4), data=obs))

However, we can almost always create an even better prediction algorithm if we use information from all of the algorithms’ CV predictions.

### Choosing a metalearner

The advice from Erin Ledell in her PhD dissertation, Scalable Ensemble Learning and Computationally Efficient Variance Estimation, on choosing a metalearner is:

“Since the set of predictions from the various base learners may be highly correlated, it is advisable to choose a metalearning method that performs well in the presence of collinear predictors. Regularization via Ridge or Lasso regression is commonly used to overcome the issue of collinearity among the predictor variables that make up the level-one dataset.”

Here, the level-one data-set refers to the cross-validated predictions from Step 4. She goes on to discuss how several statisticians (Leo Breiman, Michael LeBlanc, Robert Tibshirani) concluded empirically that the lowest prediction errors were obtained using NNLS.

Whatever metalearner we choose is simply a tool to minimize a loss function of interest. In these examples it is the MSE, but it can change with the goals of the prediction algorithm. For example, if the goal is to build a prediction algorithm that is best for binary classification, the loss function of interest may be the rank loss, or $$1-AUC$$. It is outside the scope of this post, but for more information, I recommend Ledell’s dissertation or her paper on maximizing the Area Under the Curve (AUC) with superlearner algorithms.

### Another visual guide

The steps of the superlearner algorithm are summarized nicely in this graphic in Chapter 3 of the Targeted Learning book: # Session Info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.6
##
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
##  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
##  stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
##   SuperLearner_2.0-26 nnls_1.4            kableExtra_1.1.0
##   forcats_0.5.0       stringr_1.4.0       dplyr_1.0.2
##  tibble_3.0.3        ggplot2_3.3.2       tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
##   httr_1.4.1         jsonlite_1.6.1     viridisLite_0.3.0  foreach_1.5.0
##   modelr_0.1.6       Formula_1.2-3      assertthat_0.2.1   highr_0.8
##   cellranger_1.1.0   yaml_2.2.1         pillar_1.4.6       backports_1.1.8
##  lattice_0.20-38    glue_1.4.2         digest_0.6.25      rvest_0.3.5
##  colorspace_1.4-1   htmltools_0.4.0    Matrix_1.2-18      pkgconfig_2.0.3
##  broom_0.7.0        earth_5.1.2        haven_2.2.0        bookdown_0.19
##  scales_1.1.1       webshot_0.5.2      ranger_0.12.1      TeachingDemos_2.12
##  generics_0.0.2     farver_2.0.3       ellipsis_0.3.1     withr_2.2.0
##  cli_2.0.2          magrittr_1.5       crayon_1.3.4       readxl_1.3.1
##  evaluate_0.14      fs_1.4.1           fansi_0.4.1        xml2_1.3.0
##  cabinets_0.5.0     blogdown_0.19      tools_3.6.3        hms_0.5.3
##  lifecycle_0.2.0    munsell_0.5.0      reprex_0.3.0       glmnet_3.0-2
##  plotrix_3.7-7      compiler_3.6.3     rlang_0.4.7        plotmo_3.5.7
##  grid_3.6.3         iterators_1.0.12   rstudioapi_0.11    labeling_0.3
##  rmarkdown_2.1      gtable_0.3.0       codetools_0.2-16   DBI_1.1.0
##  R6_2.4.1           lubridate_1.7.9    knitr_1.28         shape_1.4.4
##  stringi_1.4.6      Rcpp_1.0.5         vctrs_0.3.4        dbplyr_1.4.3
##  tidyselect_1.1.0   xfun_0.14