Omitted variable bias, FWL

This fact will about covariance estimation be useful for the following discussion:

i(xix)(yiy)=i(xix)yii(xix)y=i(xix)yiyi(xix)=i(xix)yi \begin{align*} \sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) & =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}-\sum_{i}\left(x_{i}-\bar{x}\right)\bar{y}\\ & =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}-\bar{y}\sum_{i}\left(x_{i}-\bar{x}\right)\\ & =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i} \end{align*} and by the same argument i(xix)(yiy)=i(yiy)xi\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)=\sum_{i}\left(y_{i}-\bar{y}\right)x_{i} and i(xix)(xix)=i(xix)xi\sum_{i}\left(x_{i}-\bar{x}\right)\left(x_{i}-\bar{x}\right)=\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}


OLR

Per our population model

y=β0+β1x+u y = \beta_0 + \beta_1x + u and so our sample regression is

yi=β0+β1xi+ui(1) y_i = \beta_0 + \beta_1x_i + u_i \qquad(1) where the index ii identifies each sample, and our prediction is yî=β0̂+β1̂xi\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i. with residuals ui=yiyîu_i=y_i-\hat{y_i}.

We know that the OLS formula for the regression coefficient β1\beta_1 is

β1̂=Cov(xi,yi)Var(xi)=i(xix)(yiy)i(xix)(xix)=i(xix)yii(xix)xi \begin{align*} \hat{\beta_{1}} & =\frac{\text{Cov}\left(x_{i},y_{i}\right)}{\text{Var}\left(x_{i}\right)}\\ & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)\left(x_{i}-\bar{x}\right)}\\ & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}} \end{align*}

but per our model Equation 1 we can write

β1̂=i(xix)(β0+β1xi+ui)i(xix)xi=β0i(xix)+β1i(xix)xi+i(xix)uii(xix)xi==β1+i(xix)uii(xix)xi \begin{align*} \hat{\beta_{1}} & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1}x_{i}+u_{i}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}\\ & =\frac{\beta_{0}\sum_{i}\left(x_{i}-\bar{x}\right)+\beta_{1}\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}+\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}=\\ & =\beta_{1}+\frac{\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}} \end{align*}

and so, our estimate β1̂\hat{\beta_1} is equal to the population parameter β\beta IF the noise (i.e. residuals) is uncorrelated with our predictor.

The assumption that the residual term is uncorrelated with our predictor is one of the assumptions we used in setting up ordinary linear regression.

Omitted variable bias (OVB)

But what if the residuals are not uncorrelated with the predictor? For example what if our population model was really

y=β0+β1x+β2z+u y = \beta_0 + \beta_1x + \beta_2z + u with unobserved variable zz, but our estimates are yî=β0̂+β1̂xi\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i, then

β1̂=i(xix)(β0+β1xi+β2zi+ui)i(xix)xi=β0i(xix)+β1i(xix)xi+β2i(xix)zi+i(xix)uii(xix)xi=𝔼[β1̂|x]=β1+β2i(xix)zii(xix)xi \begin{align*} \hat{\beta_{1}} & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1}x_{i}+\beta_2z_{i}+u_{i}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}\\ & =\frac{\beta_{0}\sum_{i}\left(x_{i}-\bar{x}\right)+\beta_{1}\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}+\beta_{2}\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}+\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}=\\ \mathbb{E}[\hat{\beta_{1}}|x] &=\beta_{1}+\beta_{2}\frac{\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}} \end{align*} and the bias in our estimate of β1\beta_1 is i(xix)zii(xix)xi\frac{\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}

OVB in action:

In this example we will simulate what happens with linearly dependent predictors.

We have seen the data below in lab-4, where

  • y=demand1y = \text{demand1}
  • x=price1,β1=0.5x = \text{price1},\;\beta_1=-0.5
  • z=unobserved1,β2=1z = \text{unobserved1},\;\beta_2=-1
> set.seed(1966)
> 
> dat1 <- tibble::tibble(
+   unobserved1 = rnorm(500)
+   , price1 = 10 + unobserved1 + rnorm(500)
+   , demand1 = 23 -(0.5*price1 + unobserved1 + rnorm(500))
+ )

Without including the unobserved variable, the fit is

> fit1 <- lm(demand1 ~ price1, data = dat1)
> broom::tidy(fit1)
# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   27.8      0.387       71.8 5.65e-265
2 price1        -0.989    0.0384     -25.8 9.44e- 94

and this fit estimates β1̂=0.99<0.5\hat{\beta_1}=-0.99<-0.5 so it is incorrect & biased. Checking using the covariance formula:

> cov(dat1$demand1, dat1$price1)/var(dat1$price1)
[1] -0.9893123

But we know we have an unobserved variable (and we know β2=1\beta_2=-1) so we can correct the bias.

> # biased estimate
> biased_est <- broom::tidy(fit1) %>% 
+   dplyr::filter(term == 'price1') %>% 
+   dplyr::pull(estimate)
> 
> # bias
> bias <- -cov(dat1$unobserved1, dat1$price1)/var(dat1$price1)
> 
> #corrected estimate
> biased_est - bias
[1] -0.4841394
> lm(demand1 ~ price1 + unobserved1, data = dat1) %>% 
+   broom::tidy()
# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   22.4      0.443      50.5  1.10e-197
2 price1        -0.443    0.0444     -9.98 1.74e- 21
3 unobserved1   -1.08     0.0638    -17.0  3.21e- 51

Frisch–Waugh–Lovell theorem

FWL or decomposition theorem:

When estimating a model of the form

y=β0+β1x1+β2x2+u y = \beta_0 + \beta_1x_1 + \beta_2x_2 + u

then the following estimators of β1\beta_1 are equivalent

  • the OLS estimator obtained by regressing yy on x1x_1 and x2x_2
  • the OLS estimator obtained by regressing yy on x̌1\check{x}_1
    • where x̌1\check{x}_1 is the residual from the regression of x1x_1 on x2x_2
  • the OLS estimator obtained by regressing y̌\check{y} on x̌1\check{x}_1
    • where y̌\check{y} is the residual from the regression of yy on x2x_2

Interpretation:

The Frisch-Waugh-Lowell theorem is telling us that there are multiple ways to estimate a single regression coefficient. One possibility is to run the full regression of yy on xx, as usual.

However, we can also regress x1x_1 on x2x_2, take the residuals, and regress yy only those residuals. The first part of this process is sometimes referred to as partialling-out (or orthogonalization, or residualization) of x1x_1 with respect to x2x_2. The idea is that we are isolating the variation in x1x_1 that is independent of (orthogonal to) x2x_2. Note that x2x_2 can be also be multi-dimensional (i.e. include multiple variables and not just one).

Why would one ever do that?

This seems like a way more complicated procedure. Instead of simply doing the regression in 1 step, now we need to do 2 or even 3 steps. It’s not intuitive at all. The main advantage comes from the fact that we have reduced a multivariate regression to a univariate one, making more tractable and more intuitive.

Example1

Using the data from OVB example:

> # partial out unobserved1 (predictor) from price1 (predictor)
> fit_price <- lm(price1 ~ unobserved1, data = dat1)
> 
> # regress demand1 (outcome) on price residuals
> lm(
+   demand1 ~ price_resid
+   , data = tibble::tibble(demand1 = dat1$demand1, price_resid = fit_price$residuals)
+ ) %>% 
+   broom::tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic     p.value
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   17.9      0.0830    216.   0          
2 price_resid   -0.443    0.0828     -5.35 0.000000135
> # partial out unobserved1 (predictor) from demand1 (outcome)
> fit_demand <- lm(demand1 ~ unobserved1, data = dat1)
> 
> # regress demand residuals on price residuals
> lm(
+   demand_resid ~ price_resid
+   , data = tibble::tibble(demand_resid = fit_demand$residuals, price_resid = fit_price$residuals)
+ ) %>% 
+   broom::tidy()
# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  1.37e-16    0.0445  3.08e-15 1.00e+ 0
2 price_resid -4.43e- 1    0.0444 -9.99e+ 0 1.59e-21

Partial Identification

In our linear regression models we include errors (aka noise) in the model of the outcome

yi=β0+β1xi+ui y_i = \beta_0 + \beta_1x_i + u_i

where each sample uiu_i is drawn from a Gaussian distributions with zero mean and variance σ2\sigma^2.

What if we are also uncertain about xix_i? In particular what if instead of xix_i we can only use x̃i=xi+wi\tilde{x}_i=x_i + w_i where

Cov(w,x)=Cov(w,u)=𝔼[w]=0 \text{Cov}\left(w,x\right)=\text{Cov}\left(w,u\right)=\mathbb{E}\left[w\right]=0

i.e. x̃i\tilde{x}_i reflects measurement error in the predictor xix_i, where the measurement error has zero mean and is independent of xx and uu (e.g. the measurement error is an independent Gaussian).

In this case 𝔼[x̃]=𝔼[x+w]=𝔼[x]\mathbb{E}\left[\tilde{x}\right]=\mathbb{E}\left[x+w\right]=\mathbb{E}\left[x\right] and

Cov(x̃,y)=Cov(x+w,y)=Cov(x,y)+Cov(w,y)=Cov(x,y)+Cov(w,β0+β1x+u)=Cov(x,y)+Cov(w,u)+β1Cov(w,x)=Cov(x,y) \begin{align*} \text{Cov}\left(\tilde{x},y\right) & =\text{Cov}\left(x+w,y\right)=\text{Cov}\left(x,y\right)+\text{Cov}\left(w,y\right)\\ & =\text{Cov}\left(x,y\right)+\text{Cov}\left(w,\beta_{0}+\beta_{1}x+u\right)\\ & =\text{Cov}\left(x,y\right)+\text{Cov}\left(w,u\right)+\beta_{1}\text{Cov}\left(w,x\right)\\ & =\text{Cov}\left(x,y\right) \end{align*}

however Var(x̃)=Var(x+w)Var(x)\text{Var}\left(\tilde{x}\right)=\text{Var}\left(x+w\right)\ge\text{Var}\left(x\right) so

Cov(x̃,y)Var(x̃)=Cov(x,y)Var(x)+Var(w)=Cov(x,y)/Var(x)1+Var(w)/Var(x)=β11+Var(w)/Var(x) \begin{align*} \frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} & =\frac{\text{Cov}\left(x,y\right)}{\text{Var}\left(x\right)+\text{Var}\left(w\right)}\\ & =\frac{\text{Cov}\left(x,y\right)/\text{Var}\left(x\right)}{1+\text{Var}\left(w\right)/\text{Var}\left(x\right)}\\ & =\frac{\beta_{1}}{1+\text{Var}\left(w\right)/\text{Var}\left(x\right)} \end{align*} and since Var(w)/Var(x)\text{Var}\left(w\right)/\text{Var}\left(x\right) is non-negative Cov(x̃,y)Var(x̃)\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} has the same sign as β1\beta_1 and our data gives us a lower bound for β1\beta_1:

|Cov(x̃,y)Var(x̃)||β1| \left|\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right|\le\left|\beta_{1}\right|

If we reverse the regression (regress x̃\tilde{x} on yy)

Cov(x̃,y)Var(y)=Cov(x,y)β12Var(x)+Var(u)=β1Var(x)β12Var(x)+Var(u) \begin{align*} \frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(y\right)} & =\frac{\text{Cov}\left(x,y\right)}{\beta_{1}^{2}\text{Var}\left(x\right)+\text{Var}\left(u\right)}\\ & =\frac{\beta_{1}\text{Var}\left(x\right)}{\beta_{1}^{2}\text{Var}\left(x\right)+\text{Var}\left(u\right)} \end{align*} and taking the reciprocal:

Var(y)Cov(x̃,y)=β1+Var(u)𝛃𝟏𝐕𝐚𝐫(𝐱)=β1[1+Var(u)𝛃𝟏𝟐𝐕𝐚𝐫(𝐱)] \frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}=\beta_{1}+\frac{\text{Var}\left(u\right)}{\mathbf{\beta_{1}\text{Var}\left(x\right)}}=\beta_{1}\left[1+\frac{\text{Var}\left(u\right)}{\mathbf{\beta_{1}^{2}\text{Var}\left(x\right)}}\right] and the factor in the brackets is greater than 1 and as before Cov(x̃,y)Var(x̃)\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} has the same sign as β1\beta_1 so

|Var(y)Cov(x̃,y)||β1| \left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|\ge\left|\beta_{1}\right|

Some terminology:

  • a bound is sharp if it cannot be improved (under our assumptions)
  • a bound is tight if it is short enough to be useful in practice

We have sharp bounds for β1\beta_1 under measurement error:

|Var(y)Cov(x̃,y)||β1||Cov(x̃,y)Var(x̃)|(2) \left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|\ge\left|\beta_{1}\right|\ge\left|\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right| \qquad(2)

With respect to how tight the bounds are, let rr denote the correlation between x̃\tilde{x} and yy, then

r2=Cov(x̃,y)2Var(x̃)Var(y)=Cov(x̃,y)Var(x̃)×Cov(x̃,y)Var(y)r2×Var(y)Cov(x̃,y)=Cov(x̃,y)Var(x̃) \begin{align*} r^{2} & =\frac{\text{Cov}\left(\tilde{x},y\right)^{2}}{\text{Var}\left(\tilde{x}\right)\text{Var}\left(y\right)}=\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\times\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(y\right)}\\ r^{2}\times\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)} & =\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} \end{align*}

and the the width of the bounds in Equation 2 is

width=|Var(y)Cov(x̃,y)Cov(x̃,y)Var(x̃)|=(1r2)|Var(y)Cov(x̃,y)| \text{width}=\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}-\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right|=\left(1-r^{2}\right)\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|

So the bounds for β1\beta_1 are tighter when x̃\tilde{x} and yy are strongly correlated.

see ovb