Becoming a Biostatistician: A Follow-up to 'A Day in the Life of a Biostatistician'

It’s been over a year and a half since I began this blog with my, “A Day in the Life of a Biostatistician” post. To my surprise, it is hands-down the post I receive the most emails about. I really enjoy hearing from everyone who reaches out, but I thought I’d give detailed answers to the most common questions I’ve received in this follow up post. Caveat all my answers with “this is just one early-career biostatistician’s opinion,” please!

An Illustrated Guide to TMLE, Part III: Properties, Theory, and Learning More

The is the third and final post in a three-part series to help beginners and/or visual learners understand Targeted Maximum Likelihood Estimation (TMLE). In this section, I discuss more statistical properties of TMLE, offer a brief explanation for the theory behind TMLE, and provide resources for learning more. Properties of TMLE 📈 To reiterate a point from Parts I and II, a main motivation for TMLE is that it allows the use of machine learning algorithms while still yielding asymptotic properties for inference.

An Illustrated Guide to TMLE, Part II: The Algorithm

The second post of a three-part series to help beginners and/or visual learners understand Targeted Maximum Likelihood Estimation (TMLE). This section walks through the TMLE algorithm for the mean difference in outcomes for a binary treatment and binary outcome. This post is an expansion of a printable “visual guide” available on my Github. I hope it helps analysts who feel out-of-practice reading mathematical notation follow along with the TMLE algorithm.

An Illustrated Guide to TMLE, Part I: Introduction and Motivation

The introductory post of a three-part series to help beginners and/or visual learners understand Targeted Maximum Likelihood Estimation (TMLE). This section contains a brief overview of the targeted learning framework and motivation for semiparametric estimation methods for inference, including causal inference. Table of Contents This blog post series has three parts: Part I: Motivation TMLE in three sentences 🎯 An Analyst’s Motivation for Learning TMLE 👩🏼‍💻 Is TMLE Causal Inference?

Become a Superlearner! An Illustrated Guide to Superlearning

Why use one machine learning algorithm when you could use all of them?! This post contains a step-by-step walkthrough of how to build a superlearner prediction algorithm in R. HTML Image as link A Visual Guide… Over the winter, I read Targeted Learning by Mark van der Laan and Sherri Rose. This “visual guide” I made for Chapter 3: Superlearning by Rose, van der Laan, and Eric Polley is a condensed version of the following tutorial.

On the Sidelines: NYC's COVID-19 Outbreak from the Eyes of a Pulmonary and Critical Care Team's Biostatistician

A personal narrative about my experience on the data analytics side of NYC’s COVID-19 outbreak response. December 15, 2018. My coworker is moving to California. She’s a statistician for a group of pulmonary and critical care physicians at our New York City hospital, and I’m a statistician who’s trying not to do too many things wrong, only three months into my first job out of school. “I think you’d be good with this research team,” she tells me.

Customizable correlation plots in R

TL;DR If you’re ever felt limited by correlogram packages in R, this post will show you how to write your own function to tidy the many correlations into a ggplot2-friendly form for plotting. By the end, you will be able to run one function to get a tidied data frame of correlations: formatted_cors(mtcars) %>% head() %>% kable() measure1 measure2 r n p sig_p p_if_sig r_if_sig mpg mpg 1.

Rethinking Conditional and Iterated Expectations with Linear Regression Models

An “aha!” moment: the day I realized I should rethink all the probability theorems using linear regressions. TL;DR You can a regress an outcome on a grouping variable plus any other variable(s) and the unadjusted and adjusted group means will be identical. We can see this in a simple example using the palmerpenguins data: #remotes::install_github("allisonhorst/palmerpenguins") library(palmerpenguins) library(tidyverse) library(gt) # use complete cases for simplicity penguins <- drop_na(penguins) penguins %>% # fit a linear regression for bill length given bill depth and species # make a new column containing the fitted values for bill length mutate(preds = predict(lm(bill_length_mm ~ bill_depth_mm + species, data = .

Lessons learned: my top five coding 'tricks' during the NYC COVID-19 outbreak

In non-coronavirus times, I am the biostatistician for a team of NYC pulmonologists and intensivists. When the pandemic hit NYC in mid-March, I immediately became a 100% 200% COVID-19 statistician. I received many analysis requests, though not all of them from official investigators: My family recently learned I am the statistician for my hospital’s pulmonologists and now I get COVID-19 analysis requests from them, too — Kat Hoffman (@rkatlady) April 10, 2020 Jokes aside, I was really, really busy during the outbreak.

Patient Treatment Timelines for Longitudinal Survival Data

I am a biostatistician at a research university, and I often find myself working with longitudinal survival data. As with any data analysis, I need to examine the quality of my data before deciding which statistical methods to implement. This post contains reproducible examples for how I prefer to visually explore survival data containing longitudinal exposures or covariates. I create a “treatment timeline” for each patient, and the end product looks something like this: