Earlier this year I wrote my first blog post, “A Day in the Life of a Biostatistician,” documenting the granular details of my work as an early career academic research biostatistician. I’m excited to announce I am turning that post into a “day in the life” series in which I interview other biostatisticians with differing roles. My hope is that it will enlighten anyone interested in the field of biostatistics, and especially help undergraduate and current biostatistics Masters students make informed decisions about their careers.

Background In September 2019, I gave an R-Ladies NYC presentation about using the package sl3 to implement the superlearner algorithm for predictions. You can download the slides for it here. This post is a modification to the original demo I gave.
For a better background on what the superlearner algorithm is, please see my more recent blog post.
Step 0: Load your libraries, set a seed, and load the data You’ll likely need to install sl3 from the tlverse github page, as it was not yet on CRAN at the time of writing this post.

In early May I attended the New York R Conference. There were 24 speakers, including my coworker at Weill Cornell Medicine, Elizabeth Sweeney! Each person did a 20-minute presentation on some way they use R for their work and/or hobbies. There was a ton of information, and even though not all of it was directly useful for my workflow as a statistical consultant in an academic setting, I really enjoyed being around so many people who love R.

It seems fitting that my first blog post is on a topic that I tried and failed to find via Google search a few years ago.
I’ll back up for a second. A few years ago I was a recent college graduate, and trying hard to “figure out my life.” My major was biochemistry, which is one of those degrees where 99%* of people just keep on going to school.

When doing long, identical analyses on different data sets or variables, it can be useful to have one function which outputs your analyses in an Rmarkdown friendly (ie., with headers) format. This is a simple example of how multiple mini-analyses can be combined into one run-all function containing headers. Let’s say we have two separate data sets, dat1 and dat2, and we want to look do two analyses on each data set.

A Presentation for Weill Cornell Medicine’s Biostatistics Computing Club Image courtesy of Allison Horst’s Twitter: @allison_horst
Introduction Why dplyr? Powerful but efficient
Consistent syntax
Fast
Function chaining
Works well with entire tidyverse suite Efficiency*
Simple syntax
Function chaining
Ability to analyze external databases
Works well with other packages in tidyverse suite ggplot2 tidyr stringr forcats purrr *if you start dealing with data sets with > 1 million rows, data.

A condensed key for my corresponding TMLE tutorial blog post.
Initial set up Estimand of interest:
\[ATE = \Psi = E_W[\mathrm{E}[Y|A=1,\mathbf{W}] - \mathrm{E}[Y|A=0,\mathbf{W}]]\]
Step 1: Estimate the Outcome First, estimate the expected value of the outcome using treatment and confounders as predictors.
\[Q(A,\mathbf{W}) = \mathrm{E}[Y|A,\mathbf{W}]\] Then use that fit to obtain estimates of the expected outcome under varying three different treatment conditions:

Powered by the Academic theme for Hugo.