Why should you include your controls in the first stage of 2SLS?

So I was tutoring an undergraduate/master’s level applied econometrics course, and several students asked me why it is necessary to include also the exogenous control variables in both stages of 2SLS – More specifically, wouldn’t this double-count the correlation between the instrument and the controls?

This turns out to be a fascinating question that I somehow have never thought carefully about, and seemingly also lacks documentation online (Although I am known as a horrible search engine user…). Hence I spent half an afternoon creating an illustration, which, to maximize accessibility to an audience with little matrix algebra background, is purely algebraic:

For simplicity, consider the following regression
$$\qquad\qquad Y_i=\beta_0+\beta_1X_i+\beta_2K_i+u_i\,, \qquad\qquad(1)$$

where $X_i$ and $u_i$ are correlated, and $K_i$ is exogenous. We will need IV(s) for $X_i$ because it’s endogenous.

A quick reminder that the second stage of 2SLS is running the following regression

$$Y_i=\beta_0+\beta_1\hat{X}_i+\beta_2K_i+\nu_i\,, $$

where $\hat{X}_i$ is the predicted value of $X_i$ from the first stage, whatever that might be.

It makes sense that $K_i$ needs to be included in the second stage, as that is just running (1) using $\hat{X}_i$ instead. In the first stage, for simplicity let’s say that we are only using one instrument, $Z_i$. So the question becomes, why does

\begin{equation}
\qquad\qquad X_i=\pi_{0,1}+\pi_{1,1}Z_i+\pi_{2,1}K_i+\xi_{1,i} \qquad\qquad(2)
\end{equation}

make more sense than

\begin{equation}
\qquad\qquad X_i=\pi_{0,2}+\pi_{1,2}Z_i+\xi_{2,i}\,? \qquad\qquad(3)
\end{equation}

Well, it’s because of the omitted variable bias! If $Z$ and $K$ might be correlated (The assumptions for $Z$ to be a valid IV did not prohibit $Z$ to do so!), then we could cause ourselves troubles if we didn’t account for that in the first stage.

For illustrative purposes, assume $K_i=C_i+\theta Z_i$ where $corr(C_i,Z_i)=0$, and we are fully aware of that (i.e, actually both $C_i$ and $\theta$ are known). Then (2) becomes

\begin{align*} X_i&=\pi_{0,1}+\pi_{1,1}Z_i+\pi_{2,1}(C_i+\theta Z_i)+\xi_{1,i} \\ &=\pi_{0,1}+(\pi_{1,1}+\pi_{2,1}\theta)Z_i+\pi_{2,1}C_i+\xi_{1,i}\,, \end{align*}

So if we regress $X$ on $Z$ and $K$, we are basically regressing $X$ on $Z$ and $C$, i.e., estimating $\pi_{2,1}$ and $\pi_{1,1}+\pi_{2,1}\theta$ and backing out $\pi_{1,1}$ as we know the value of $\theta$.1 $\hat{\pi}_{1,1}$ and $\hat{\pi}_{2,1}$ will be unbiased in this situation.

On the other hand, for (3) we have an omitted variable, $K_i$. We can actually pin down the size of OVB because we know what’s omitted here! Equation (3) should be equivalent to a slight rearrangement of what we just did:

\[X_i=(\pi_{0,1}+\pi_{2,1}C_i)+(\pi_{1,1}+\pi_{2,1}\theta)Z_i+\xi_{1,i}\,,\]

which means

\begin{align*}\hat{\pi}_{0,2}&=\hat{\pi}_{0,1}+\hat{\pi}_{2,1}\bar{C}\,,\\ \hat{\pi}_{1,2}&=\hat{\pi}_{1,1}+\hat{\pi}_{2,1}\theta\,, \end{align*}

indicating that $\hat{\pi}_{1,2}$ comes with a bias of size $\hat{\pi}_{2,1}\theta$, which we will not be able to remove if we have only estimated (3) as we have no idea about the size of $\hat{\pi}_{2,1}$ unless we estimate (2).

So imagine plugging the two $\hat{X}$ to the second stage equation respectively:

\begin{align*}(2)\quad\Rightarrow\quad Y_i&=\beta_{0,1}+\beta_{1,1}\hat{X}_{i,1}+\beta_{2,1}K_i+u_i \\
&=\beta_{0,1}+\beta_{1,1}(\color{blue}{\hat{\pi}_{0,1}}\color{black}{+}(\color{blue}{\hat{\pi}_{1,1}}\color{black}{+}\color{blue}{\hat{\pi}_{2,1}\theta}\color{black}{)Z_i+}\color{blue}{\hat{\pi}_{2,1}}\color{black}{C_i)+\beta_{2,1}(C_i+}\color{blue}{\theta} \color{black}{Z_i)+u_i} \\
&=(\beta_{0,1}+\beta_{1,1}\color{blue}{\hat{\pi}_{0,1}}\color{black}{)+[\beta_{1,1}(}\color{blue}{\hat{\pi}_{1,1}}\color{black}{+}\color{blue}{\hat{\pi}_{2,1}\theta}\color{black}{)+\beta_{2,1}}\color{blue}{\theta}\color{black}{]Z_i+(\beta_{1,1}}\color{blue}{\hat{\pi}_{2,1}}\color{black}{+\beta_{2,1})C_i+u_i} \\
(3)\quad\Rightarrow\quad Y_i&=\beta_{0,2}+\beta_{1,2}\hat{X}_{i,2}+\beta_{2,2}K_i+u_i \\
&=\beta_{0,2}+\beta_{1,2}(\color{blue}{\hat{\pi}_{0,2}}\color{black}{+}\color{blue}{\hat{\pi}_{1,2}}\color{black}{Z_i)+\beta_{2,2}(C_i+}\color{blue}{\theta} \color{black}{Z_i)+u_i} \\
&=(\beta_{0,2}+\beta_{1,2}\color{blue}{\hat{\pi}_{0,2}}\color{black}{)+(\beta_{1,2}}\color{blue}{\hat{\pi}_{1,2}}\color{black}{+\beta_{2,2}}\color{blue}{\theta}\color{black}{)Z_i+\beta_{2,2}C_i+u_i}\end{align*}

I’m colouring all the numbers we know in each scenario in blue. The reason why I intentionally used $\beta_{j,1}$ and $\beta_{j,2}$ is because as you’ll soon see, plugging in the two different $\hat{X}$ gives you different estimates!

You’ll see how if we assume we are running this regression instead:

$$Y_i=\gamma_0+\gamma_1Z_i+\gamma_2C_i+u_i\,.$$

Since $Z$ and $C$ are both exogenous, we can obtain unbiased $\hat{\gamma}\,$s:

\begin{array}{rclcl} \color{blue}{\hat{\gamma}_0}&\color{black}{=}&\hat{\beta}_{0,1}+\hat{\beta}_{1,1}\color{blue}{\hat{\pi}_{0,1}}&\color{black}{=}&\hat{\beta}_{0,2}+\hat{\beta}_{1,2}\color{blue}{\hat{\pi}_{0,2}} \\ \color{blue}{\hat{\gamma}_1}&\color{black}{=}&\hat{\beta}_{1,1}(\color{blue}{\hat{\pi}_{1,1}}\color{black}{+}\color{blue}{\hat{\pi}_{2,1}\theta}\color{black}{)+\hat{\beta}_{2,1}}\color{blue}{\theta}&\color{black}{=}&\hat{\beta}_{1,2}\color{blue}{\hat{\pi}{1,2}}\color{black}{+\hat{\beta}_{2,2}}\color{blue}{\theta} \\
\color{blue}{\hat{\gamma}_2}&\color{black}{=}&\hat{\beta}_{1,1}\color{blue}{\hat{\pi}_{2,1}}\color{black}{+\hat{\beta}_{2,1}}&\color{black}{=}&\hat{\beta}_{2,2}
\end{array}

Which, after a little algebra solves the IV estimates corresponding to the two different first stages:

\begin{align*} (2)\quad\Rightarrow\quad \hat{\beta}_{1,1}&=\frac{\color{blue}{\hat{\gamma}_1}\color{black}{-}\color{blue}{\hat{\gamma}_2\theta}}{\color{blue}{\hat{\pi}_{1,1}}}\\
(3)\quad\Rightarrow\quad \hat{\beta}_{1,2}&=\frac{\color{blue}{\hat{\gamma}_1}\color{black}{-}\color{blue}{\hat{\gamma}_2\theta}}{\color{blue}{\hat{\pi}_{1,2}}}=\frac{\color{blue}{\hat{\gamma}_1}\color{black}{-}\color{blue}{\hat{\gamma}_2\theta}}{\hat{\pi}_{1,1}+\hat{\pi}_{2,1}\color{blue}{\theta}}
\end{align*}

The only difference between the two IV estimates are the denominators! $\color{blue}{\hat{\pi}_{1,1}}$ is not biased, while $\color{blue}{\hat{\pi}_{1,2}}$ is biased. Hence if we run the first stage without including $K_i$, we end up getting a biased IV estimate of $\beta_1$

  1. In practice you can imagine that we regress $K$ on $Z$ and $C$ to get $\hat{\theta}$ ↩︎

2 responses to “Why should you include your controls in the first stage of 2SLS?”

  1. Filippo Avatar
    Filippo

    Can’t you just check if K and Z are correlated to know if you should include K in the first stage as well??

    1. happy_barbarian Avatar

      There’s nothing stopping you from doing that! But 1) with real life data they are unlikely to be not correlated at all; 2) that’s one extra step – Including all control variables in the first stage is just a more no-brainer and foolproof way to do it, I guess!

Leave a Reply

Your email address will not be published. Required fields are marked *