# TL;DR

You can a regress an outcome on a grouping variable

*plus any other variable(s)*and the unadjusted and adjusted group means will be identical.We can see this in a simple example using the

`iris`

data:

```
iris %>%
# fit a linear regression for sepal length given sepal width and species
# make a new column containing the fitted values for sepal length
mutate(preds = predict(lm(Sepal.Length ~ Sepal.Width + Species, data = .))) %>%
# compute unadjusted and adjusted group means
group_by(Species) %>%
summarise(mean_SL = mean(Sepal.Length),
mean_SL_preds = mean(preds)) %>%
kable()
```

Species | mean_SL | mean_SL_preds |
---|---|---|

setosa | 5.006 | 5.006 |

versicolor | 5.936 | 5.936 |

virginica | 6.588 | 6.588 |

This is because \(E[E[Y|X,Z]|Z=z]=E[Y|Z=z]\).

We can view a fitted value from the regression, \(E[Y|X,Z]\), as a random variable to help us see this.

Skip to the end to see the proof.

I’ll admit I spent much of my first-semester of grad school struggling to understand the difference between \(X\) and \(x\). When I finally learned all the rules for expectations of random variables, I still had zero appreciation for their implications in my future work as an applied statistician.

I recently found myself in a rabbit hole of expectation properties while trying to write a seemingly simple function in `R`

. Now that I have the output of my function all sorted out, I have a newfound appreciation for how I can use regressions – a framework I’m very comfortable with – to rethink some of the properties I learned in my probability theory courses.

In the function, I was regressing an outcome on a few variables plus a grouping variable, and then returning the group means of the fitted values. My function kept outputting adjusted group means that were *identical* to the unadjusted group means.

I soon realized that for what I needed to do, my grouping variable should not be in the regression model. However, I was still perplexed as to how the adjusted and unadjusted group means could be the same.

I created a very basic example to test this unexpected result. I regressed a variable from the `iris`

data set, `Sepal.Length`

, on another variable called `Sepal.Width`

and a grouping variable `Species`

. I then looked at the mean within each category of `Species`

for both the unadjusted `Sepal.Length`

and fitted values from my linear regression model for `Sepal.Length`

.

```
library(dplyr)
library(knitr)
```

```
iris %>%
# fit a linear regression for sepal length given sepal width and species
# make a new column containing the fitted values for sepal length
mutate(preds = predict(lm(Sepal.Length ~ Sepal.Width + Species, data = .))) %>%
# compute unadjusted and adjusted group means
group_by(Species) %>%
summarise(mean_SL = mean(Sepal.Length),
mean_SL_preds = mean(preds)) %>%
kable()
```

Species | mean_SL | mean_SL_preds |
---|---|---|

setosa | 5.006 | 5.006 |

versicolor | 5.936 | 5.936 |

virginica | 6.588 | 6.588 |

I saw the same strange output, even in my simple example. I realized this must be some statistics property I’d learned about and since forgotten, so I decided to write out what I was doing in expectations.

First, I wrote down the unadjusted group means in the form of an expectation. I wrote down a conditional expectation, since we are looking at the mean of `Sepal.Length`

when `Species`

is restricted to a certain category. We can explicitly show this by taking the expectation of a random variable, \(\mathrm{Sepal Length}\), while setting another random variable, \(\mathrm{Species}\), equal to only one category at a time.

\(E[\mathrm{SepalLength}|\mathrm{Species}=setosa]\)

\(E[\mathrm{SepalLength}|\mathrm{Species}=virginica]\)

\(E[\mathrm{SepalLength}|\mathrm{Species}=versicolor]\)

More generally, we could write out the unadjusted group mean using a group indicator variable, \(\mathrm{Species}\), which can take on all possible values \(species\).

\(E[\mathrm{SepalLength}|\mathrm{Species}=species]\)

So that’s our unadjusted group means. What about the adjusted group mean? We can start by writing out the linear regression model, which is the expected value of \(\mathrm{SepalLength}\), conditional on the random variables \(\mathrm{SepalWidth}\) and \(\mathrm{Species}\).

\(E[\mathrm{SepalLength}|\mathrm{SepalWidth},\mathrm{Species}]\)

When I used the `predict`

function on the fit of that linear regression model, I obtained the fitted values from that expectation, before I separated the fitted values by group to get the grouped means. We can see those fitted values as random variables themselves, and write out another conditional mean using a group indicator variable, just as we did for the unadjusted group means earlier.

\[E[E[\mathrm{SepalLength}|\mathrm{SepalWidth},\mathrm{Species}]|\mathrm{Species}=species]\]

My table of unadjusted and adjusted Sepal Length means thus showed me that:

\[E[E[\mathrm{SepalLength}|\mathrm{SepalWidth},\mathrm{Species}]|\mathrm{Species}=species] \\ = E[\mathrm{SepalLength}|\mathrm{Species}=species]\]

Or, in more general notation:

\[E[E[Y|X,Z]|Z=z] = E[Y|Z=z]\]

Is it true?! Spoiler alert – yes. Let’s work through the steps of the proof one by one.

# Proof set-up

*Let’s pretend for the proof that both our \(Y\) (outcome), \(X\) (adjustment variable), and \(Z\) (grouping variable) are categorical (discrete) variables. This is just to make the math a bit cleaner, since the expectation of a discrete variable (a weighted summation) is a little easier to show than the expectation of a continuous variable (the integral of a probability density function times the realization of the random variable).*

*A few fundamental expectation results we’ll need:*

#### Conditional probability

\(P(A|B) = \frac{P(A ∩ B)}{P(B)}\)

#### Partition theorem

\(E[A|B] = \sum_Ba \cdot P(A=a|B=b)\)

#### Marginal distribution from a joint distribution

\(\sum_A\sum_Ba\cdot P(A=a,B=b) = \sum_Aa\sum_B\cdot P(A=a,B=b) = \sum_Aa\cdot P(A=a)=E[A]\)

# Step-by-step Proof

Click on the superscript number after each step for more information.

\(E[E[Y|X,Z]|Z=z]\)

\(=E[E[Y|X,Z=z]|Z=z]\) ^{1}

\(=\sum_{X}E[Y|X=x,Z=z]\cdot P(X=x|Z=z)\) ^{2}

\(=\sum_{X}\sum_{Y}y P(Y=y|X=x,Z=z)\cdot P(X=x|Z=z)\) ^{3}

\(=\sum_{X}\sum_{Y}y \frac{P(Y=y,X=x,Z=z)}{P(X=x,Z=z)}\cdot \frac{P(X=x,Z=z)}{P(Z=z)}\) ^{4}

\(=\sum_{X}\sum_{Y}y \frac{P(Y=y,X=x,Z=z)}{P(Z=z)}\) ^{5}

\(=\sum_{Y}y\sum_{X}\frac{P(Y=y,X=x,Z=z)}{P(Z=z)}\) ^{6}

\(=\sum_{Y}y\frac{P(Y=y,Z=z)}{P(Z=z)}\) ^{7}

\(=\sum_{Y}y P(Y=y|Z=z)\) ^{8}

\(=E[Y|Z=z]\) ^{9}

So, we’ve proved that:

\(E[E[Y|X,Z]|Z=z] = E[Y|Z=z]\)

which, thankfully, means I have an answer to my function output confusion. It was a lightbulb moment for me to realize I can think of an inner expectation as a random variable, and all the rules I learned about conditional and iterated expectations can be applied to the regressions I fit on a daily basis.

Here’s hoping you too feel inspired to revisit probability theory from time to time, even if your work is very applied. It is, after all, a perfect activity for social distancing! 😷

# References

Because we’re making our outer expectation conditional on \(Z=z\), we can also move \(Z=z\) into our inner expectation. This becomes obvious in the

`iris`

example, since we only use the fitted values from one category of`Species`

to get the adjusted group mean for that category.↩We can rewrite \(E[Y|X,Z=z]\) as the weighted summation of all possible values \(X\) can take. \(E[Y|X,Z=z]\) will only ever be able to take values of \(X\) that vary over the range of \(x\), \(E[Y|X=x,Z=z]\) since our value \(z\) is already fixed. We can weight each of these possible \(E[Y|X=x,Z=z]\) values by \(P(X=x|Z=z)\), since that’s the probabilty \(X\) will take value \(x\) at our already-fixed \(z\). Thus, we can start to find \(E[E[Y|X,Z=z]|Z=z]\) by weighting each \(E[Y|X=x,Z=z]\) by \(P(X=x|Z=z)\) and adding them all up (see Partition Theorem).↩

We can get the expectation of \(Y\) at each of those possible values of \(X\) by a similar process as step 2 (weighting each \(y\) by \(P(Y=y|X=x, Z=z)\).↩

By the Law of Conditional Probability, we can rewrite our conditional probabilities as joint distributions.↩

The denominator of the first fraction cancels out with the numerator of the second fraction.↩

We can switch the summations around so that \(y\) is outside the summation over all values of \(X\). This lets us get the joint distribution of only \(Y\) and \(Z\).↩

This is a conditional expectation, written in the form of a joint distribution.↩

By the Partition Theorem.↩

Rewriting the previous equation as an expectation.↩