# Background

In September 2019, I gave an R-Ladies NYC presentation about using the package `sl3`

to implement the superlearner algorithm for predictions. You can download the slides for it here. This post is a modification to the original demo I gave.

For a better background on what the superlearner algorithm is, please see my more recent blog post.

# Step 0: Load your libraries, set a seed, and load the data

You’ll likely need to install `sl3`

from the `tlverse`

github page, as it was not yet on CRAN at the time of writing this post.

```
#devtools::install_github("tlverse/sl3")
library(sl3)
library(dplyr)
```

```
##
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
##
## filter, lag
```

```
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
```

`library(kableExtra)`

```
##
## Attaching package: 'kableExtra'
```

```
## The following object is masked from 'package:dplyr':
##
## group_rows
```

`set.seed(7)`

We will use the same WASH Benefits data set as the TLverse does in their Hitchhiker’s Guide. We will be predicting children in rural Kenya and Bengladesh’s weight to height z-scores.

`washb_data <- read.csv("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv")`

`kable(head(washb_data))`

whz | tr | fracode | month | aged | sex | momage | momedu | momheight | hfiacat | Nlt18 | Ncomp | watmin | elec | floor | walls | roof | asset_wardrobe | asset_table | asset_chair | asset_khat | asset_chouki | asset_tv | asset_refrig | asset_bike | asset_moto | asset_sewmach | asset_mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0.00 | Control | N05265 | 9 | 268 | male | 30 | Primary (1-5y) | 146.40 | Food Secure | 3 | 11 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

-1.16 | Control | N05265 | 9 | 286 | male | 25 | Primary (1-5y) | 148.75 | Moderately Food Insecure | 2 | 4 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.05 | Control | N08002 | 9 | 264 | male | 25 | Primary (1-5y) | 152.15 | Food Secure | 1 | 10 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-1.26 | Control | N08002 | 9 | 252 | female | 28 | Primary (1-5y) | 140.25 | Food Secure | 3 | 5 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

-0.59 | Control | N06531 | 9 | 336 | female | 19 | Secondary (>5y) | 150.95 | Food Secure | 2 | 7 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

-0.51 | Control | N06531 | 9 | 304 | male | 20 | Secondary (>5y) | 154.20 | Severely Food Insecure | 0 | 3 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

# Step 1: Specify outcome and predictors

We need the columns for the outcome and predictors to be specified as strings.

```
outcome <- "whz"
covars <- washb_data %>%
select(-whz) %>%
names()
```

# Step 2: Make an sl3 task

This is the object we’ll use whenever we want to fit a statistical model in `sl3`

.

```
washb_task <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
```

```
## Warning in process_data(data, nodes, column_names = column_names, flag = flag, :
## Missing covariate data detected: imputing covariates.
```

Note that we can’t have missing data in most statistical learning algorithms, so `sl3`

’s default pre-processing imputes at the median and adds a column for missingness (in case the missingness is informative).

`washb_task`

```
## A sl3 Task with 4695 obs and these nodes:
## $covariates
## [1] "tr" "fracode" "month" "aged"
## [5] "sex" "momage" "momedu" "momheight"
## [9] "hfiacat" "Nlt18" "Ncomp" "watmin"
## [13] "elec" "floor" "walls" "roof"
## [17] "asset_wardrobe" "asset_table" "asset_chair" "asset_khat"
## [21] "asset_chouki" "asset_tv" "asset_refrig" "asset_bike"
## [25] "asset_moto" "asset_sewmach" "asset_mobile" "delta_momage"
## [29] "delta_momheight"
##
## $outcome
## [1] "whz"
##
## $id
## NULL
##
## $weights
## NULL
##
## $offset
## NULL
##
## $time
## NULL
```

## An aside: Exploring `sl3`

’s many options

There’s a ton of different aspects of model fitting `sl3`

has the capabilities to address. For example, we can look into algorithms for when the outcome is binomial, categorical, or continuous. There are also options for when you have clustered data, or if you need to preprocess/screen your data before implementing base learners.

`sl3_list_properties()`

```
## [1] "binomial" "categorical" "continuous"
## [4] "cv" "density" "ids"
## [7] "multivariate_outcome" "offset" "preprocessing"
## [10] "sampling" "screener" "timeseries"
## [13] "weights" "wrapper"
```

We can learn more about each of these properties on this reference page.

## Another aside: looking at available “learners”

We’ll need to pick out base learners for our stack, as well as pick a metalearner. Since we are trying to predict z-scores, a continuous variable, let’s look at our potential learners for a continuous variable.

`sl3_list_learners("continuous") `

```
## [1] "Lrnr_arima" "Lrnr_bartMachine"
## [3] "Lrnr_bilstm" "Lrnr_bound"
## [5] "Lrnr_caret" "Lrnr_cv_selector"
## [7] "Lrnr_dbarts" "Lrnr_earth"
## [9] "Lrnr_expSmooth" "Lrnr_gam"
## [11] "Lrnr_gbm" "Lrnr_glm"
## [13] "Lrnr_glm_fast" "Lrnr_glmnet"
## [15] "Lrnr_grf" "Lrnr_gts"
## [17] "Lrnr_h2o_glm" "Lrnr_h2o_grid"
## [19] "Lrnr_hal9001" "Lrnr_HarmonicReg"
## [21] "Lrnr_hts" "Lrnr_lstm"
## [23] "Lrnr_mean" "Lrnr_multiple_ts"
## [25] "Lrnr_nnet" "Lrnr_nnls"
## [27] "Lrnr_optim" "Lrnr_pkg_SuperLearner"
## [29] "Lrnr_pkg_SuperLearner_method" "Lrnr_pkg_SuperLearner_screener"
## [31] "Lrnr_polspline" "Lrnr_randomForest"
## [33] "Lrnr_ranger" "Lrnr_rpart"
## [35] "Lrnr_rugarch" "Lrnr_screener_corP"
## [37] "Lrnr_screener_corRank" "Lrnr_screener_randomForest"
## [39] "Lrnr_solnp" "Lrnr_stratified"
## [41] "Lrnr_svm" "Lrnr_tsDyn"
## [43] "Lrnr_xgboost"
```

You’ll notice each learner starts with `Lrnr`

and seems to correspond to a package in `R`

.

# Step 3: Choosing the base learners

Let’s pick just a few base learners to match the examples in my slides: a random forest, a generalized boosting model, and a generalized linear model. Let’s keep their default parameters for now.

`make_learner_stack()`

is an easy way to create a stack of default baselearners. It takes the names of the learners as strings and you’re good to go!

```
stack <- make_learner_stack(
"Lrnr_randomForest",
"Lrnr_gbm",
"Lrnr_glm"
)
```

# Step 4: Choose a metalearner

There are many models we can choose from but we’ll keep it simple and use a generalized linear model. We are again using the `make_learner()`

function.

`metalearner <- make_learner(Lrnr_glm)`

# Step 5: Make a superlearner object

Remember, under-the-hood `Lrnr_sl`

takes the cross-validated predictions from the base models and uses them to predict the true outcome. That prediction model then is used to fit the predictions from base learners fit on the whole data set.

```
sl <- make_learner(Lrnr_sl,
learners = stack,
metalearner = metalearner)
```

A superlearner object has different functions built into it, such as `train()`

. We can train our superlearner shell model on the task we made earlier.

# Step 6: Train your superlearner

`sl_fit <- sl$train(washb_task)`

# Step 7: Examine the results of the superlearner

## Examine coefficients and CV-risk

The default risk is MSE (Mean Squared Error). The coefficients show you how the metalearner decided to weight each base model for the final ensemble.

`sl_fit$print() %>% kable() %>% kable_styling(c("striped","condensed","hover"))`

```
## [1] "SuperLearner:"
## List of 3
## $ : chr "Lrnr_randomForest_100_TRUE_5_NULL_FALSE"
## $ : chr "Lrnr_gbm_10000_2_0.001"
## $ : chr "Lrnr_glm_TRUE"
## [1] "Lrnr_glm_TRUE"
## $coefficients
## intercept Lrnr_randomForest_100_TRUE_5_NULL_FALSE
## -0.02935653 0.20828590
## Lrnr_gbm_10000_2_0.001 Lrnr_glm_TRUE
## 0.68702726 0.05494122
##
## $R
## intercept
## intercept -68.52007
## Lrnr_randomForest_100_TRUE_5_NULL_FALSE 0.00000
## Lrnr_gbm_10000_2_0.001 0.00000
## Lrnr_glm_TRUE 0.00000
## Lrnr_randomForest_100_TRUE_5_NULL_FALSE
## intercept 40.03889
## Lrnr_randomForest_100_TRUE_5_NULL_FALSE -22.33058
## Lrnr_gbm_10000_2_0.001 0.00000
## Lrnr_glm_TRUE 0.00000
## Lrnr_gbm_10000_2_0.001 Lrnr_glm_TRUE
## intercept 40.17235 40.178684
## Lrnr_randomForest_100_TRUE_5_NULL_FALSE -14.13014 -13.334809
## Lrnr_gbm_10000_2_0.001 10.62515 10.533005
## Lrnr_glm_TRUE 0.00000 -8.616894
##
## $rank
## [1] 4
##
## $family
##
## Family: gaussian
## Link function: identity
##
##
## $deviance
## [1] 4710.298
##
## $aic
## [1] 13349.11
##
## $null.deviance
## [1] 5000.347
##
## $iter
## [1] 2
##
## $df.residual
## [1] 4691
##
## $df.null
## [1] 4694
##
## $converged
## [1] TRUE
##
## $boundary
## [1] FALSE
##
## $linkinv_fun
## function (eta)
## eta
## <environment: namespace:stats>
##
## $link_fun
## function (mu)
## mu
## <environment: namespace:stats>
##
## $training_offset
## [1] FALSE
##
## [1] "Cross-validated risk (MSE, squared error loss):"
## learner coefficients risk se
## 1: Lrnr_randomForest_100_TRUE_5_NULL_FALSE 0.20828590 1.027691 0.02382883
## 2: Lrnr_gbm_10000_2_0.001 0.68702726 1.005110 0.02340409
## 3: Lrnr_glm_TRUE 0.05494122 1.019541 0.02381864
## 4: SuperLearner NA 1.003258 0.02337713
## fold_sd fold_min_risk fold_max_risk
## 1: 0.07185274 0.8720716 1.138160
## 2: 0.05893980 0.9058668 1.093545
## 3: 0.06093410 0.9113599 1.109910
## 4: 0.06151989 0.8917951 1.098658
```

learner | coefficients | risk | se | fold_sd | fold_min_risk | fold_max_risk |
---|---|---|---|---|---|---|

Lrnr_randomForest_100_TRUE_5_NULL_FALSE | 0.2082859 | 1.027691 | 0.0238288 | 0.0718527 | 0.8720716 | 1.138160 |

Lrnr_gbm_10000_2_0.001 | 0.6870273 | 1.005110 | 0.0234041 | 0.0589398 | 0.9058668 | 1.093545 |

Lrnr_glm_TRUE | 0.0549412 | 1.019541 | 0.0238186 | 0.0609341 | 0.9113599 | 1.109910 |

SuperLearner | NA | 1.003258 | 0.0233771 | 0.0615199 | 0.8917951 | 1.098658 |

## Look at the predictions

`predict()`

allows you to see what the model predicts on any given task. Here we look at predictions from the same data we trained the superlearner on, so the predicted weight to height z-scores of the first six children in our data set.

`sl_fit$predict(washb_task) %>% head()`

`## [1] -0.5846676 -0.7844686 -0.7167187 -0.7663325 -0.6388701 -0.6632535`

# Extras

Cross validate your entire ensembled superlearner using the cross-validation package

`origami`

, written by the same authors as`sl3`

. Or just hold out a testing data set to evaluate performance.Use

`make_learner()`

to customize the tuning parameters of your base learners or metalearner. Ex:`lrnr_RF_200trees <- make_lrnr(Lrnr_randomForest, ntree = 200)`

Add many layers to your superlearner and organize it into a “pipeline”

**For more demos, check out the following teaching materials from the authors of sl3.** My tutorial uses one of their example data sets in case you’d like to extend your learning via their training resources.