This fact will about covariance estimation be useful for the following discussion:
∑ i ( x i − x ‾ ) ( y i − y ‾ ) = ∑ i ( x i − x ‾ ) y i − ∑ i ( x i − x ‾ ) y ‾ = ∑ i ( x i − x ‾ ) y i − y ‾ ∑ i ( x i − x ‾ ) = ∑ i ( x i − x ‾ ) y i
\begin{align*}
\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) & =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}-\sum_{i}\left(x_{i}-\bar{x}\right)\bar{y}\\
& =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}-\bar{y}\sum_{i}\left(x_{i}-\bar{x}\right)\\
& =\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}
\end{align*}
and by the same argument ∑ i ( x i − x ‾ ) ( y i − y ‾ ) = ∑ i ( y i − y ‾ ) x i \sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)=\sum_{i}\left(y_{i}-\bar{y}\right)x_{i} and ∑ i ( x i − x ‾ ) ( x i − x ‾ ) = ∑ i ( x i − x ‾ ) x i \sum_{i}\left(x_{i}-\bar{x}\right)\left(x_{i}-\bar{x}\right)=\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}
OLR
Per our population model
y = β 0 + β 1 x + u
y = \beta_0 + \beta_1x + u
and so our sample regression is
y i = β 0 + β 1 x i + u i ( 1 )
y_i = \beta_0 + \beta_1x_i + u_i
\qquad(1) where the index i i identifies each sample, and our prediction is y i ̂ = β 0 ̂ + β 1 ̂ x i \hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i . with residuals u i = y i − y i ̂ u_i=y_i-\hat{y_i} .
We know that the OLS formula for the regression coefficient β 1 \beta_1 is
β 1 ̂ = Cov ( x i , y i ) Var ( x i ) = ∑ i ( x i − x ‾ ) ( y i − y ‾ ) ∑ i ( x i − x ‾ ) ( x i − x ‾ ) = ∑ i ( x i − x ‾ ) y i ∑ i ( x i − x ‾ ) x i
\begin{align*}
\hat{\beta_{1}} & =\frac{\text{Cov}\left(x_{i},y_{i}\right)}{\text{Var}\left(x_{i}\right)}\\
& =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)\left(x_{i}-\bar{x}\right)}\\
& =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)y_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}
\end{align*}
but per our model Equation 1 we can write
β 1 ̂ = ∑ i ( x i − x ‾ ) ( β 0 + β 1 x i + u i ) ∑ i ( x i − x ‾ ) x i = β 0 ∑ i ( x i − x ‾ ) + β 1 ∑ i ( x i − x ‾ ) x i + ∑ i ( x i − x ‾ ) u i ∑ i ( x i − x ‾ ) x i = = β 1 + ∑ i ( x i − x ‾ ) u i ∑ i ( x i − x ‾ ) x i
\begin{align*}
\hat{\beta_{1}} & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1}x_{i}+u_{i}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}\\
& =\frac{\beta_{0}\sum_{i}\left(x_{i}-\bar{x}\right)+\beta_{1}\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}+\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}=\\
& =\beta_{1}+\frac{\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}
\end{align*}
and so, our estimate β 1 ̂ \hat{\beta_1} is equal to the population parameter β \beta IF the noise (i.e. residuals) is uncorrelated with our predictor.
The assumption that the residual term is uncorrelated with our predictor is one of the assumptions we used in setting up ordinary linear regression.
Omitted variable bias (OVB)
But what if the residuals are not uncorrelated with the predictor? For example what if our population model was really
y = β 0 + β 1 x + β 2 z + u
y = \beta_0 + \beta_1x + \beta_2z + u
with unobserved variable z z , but our estimates are y i ̂ = β 0 ̂ + β 1 ̂ x i \hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i , then
β 1 ̂ = ∑ i ( x i − x ‾ ) ( β 0 + β 1 x i + β 2 z i + u i ) ∑ i ( x i − x ‾ ) x i = β 0 ∑ i ( x i − x ‾ ) + β 1 ∑ i ( x i − x ‾ ) x i + β 2 ∑ i ( x i − x ‾ ) z i + ∑ i ( x i − x ‾ ) u i ∑ i ( x i − x ‾ ) x i = 𝔼 [ β 1 ̂ | x ] = β 1 + β 2 ∑ i ( x i − x ‾ ) z i ∑ i ( x i − x ‾ ) x i
\begin{align*}
\hat{\beta_{1}} & =\frac{\sum_{i}\left(x_{i}-\bar{x}\right)\left(\beta_{0}+\beta_{1}x_{i}+\beta_2z_{i}+u_{i}\right)}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}\\
& =\frac{\beta_{0}\sum_{i}\left(x_{i}-\bar{x}\right)+\beta_{1}\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}+\beta_{2}\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}+\sum_{i}\left(x_{i}-\bar{x}\right)u_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}=\\
\mathbb{E}[\hat{\beta_{1}}|x] &=\beta_{1}+\beta_{2}\frac{\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}
\end{align*}
and the bias in our estimate of β 1 \beta_1 is ∑ i ( x i − x ‾ ) z i ∑ i ( x i − x ‾ ) x i \frac{\sum_{i}\left(x_{i}-\bar{x}\right)z_{i}}{\sum_{i}\left(x_{i}-\bar{x}\right)x_{i}}
OVB in action:
In this example we will simulate what happens with linearly dependent predictors.
We have seen the data below in lab-4, where
y = demand1 y = \text{demand1}
x = price1 , β 1 = − 0.5 x = \text{price1},\;\beta_1=-0.5
z = unobserved1 , β 2 = − 1 z = \text{unobserved1},\;\beta_2=-1
> set.seed (1966 )
>
> dat1 <- tibble:: tibble (
+ unobserved1 = rnorm (500 )
+ , price1 = 10 + unobserved1 + rnorm (500 )
+ , demand1 = 23 - (0.5 * price1 + unobserved1 + rnorm (500 ))
+ )
Without including the unobserved variable, the fit is
> fit1 <- lm (demand1 ~ price1, data = dat1)
> broom:: tidy (fit1)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 27.8 0.387 71.8 5.65e-265
2 price1 -0.989 0.0384 -25.8 9.44e- 94
and this fit estimates β 1 ̂ = − 0.99 < − 0.5 \hat{\beta_1}=-0.99<-0.5 so it is incorrect & biased. Checking using the covariance formula:
> cov (dat1$ demand1, dat1$ price1)/ var (dat1$ price1)
But we know we have an unobserved variable (and we know β 2 = − 1 \beta_2=-1 ) so we can correct the bias.
> # biased estimate
> biased_est <- broom:: tidy (fit1) %>%
+ dplyr:: filter (term == 'price1' ) %>%
+ dplyr:: pull (estimate)
>
> # bias
> bias <- - cov (dat1$ unobserved1, dat1$ price1)/ var (dat1$ price1)
>
> #corrected estimate
> biased_est - bias
> lm (demand1 ~ price1 + unobserved1, data = dat1) %>%
+ broom:: tidy ()
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.4 0.443 50.5 1.10e-197
2 price1 -0.443 0.0444 -9.98 1.74e- 21
3 unobserved1 -1.08 0.0638 -17.0 3.21e- 51
Frisch–Waugh–Lovell theorem
FWL or decomposition theorem:
When estimating a model of the form
y = β 0 + β 1 x 1 + β 2 x 2 + u
y = \beta_0 + \beta_1x_1 + \beta_2x_2 + u
then the following estimators of β 1 \beta_1 are equivalent
the OLS estimator obtained by regressing y y on x 1 x_1 and x 2 x_2
the OLS estimator obtained by regressing y y on x ̌ 1 \check{x}_1
where x ̌ 1 \check{x}_1 is the residual from the regression of x 1 x_1 on x 2 x_2
the OLS estimator obtained by regressing y ̌ \check{y} on x ̌ 1 \check{x}_1
where y ̌ \check{y} is the residual from the regression of y y on x 2 x_2
Interpretation:
The Frisch-Waugh-Lowell theorem is telling us that there are multiple ways to estimate a single regression coefficient. One possibility is to run the full regression of y y on x x , as usual.
However, we can also regress x 1 x_1 on x 2 x_2 , take the residuals, and regress y y only those residuals. The first part of this process is sometimes referred to as partialling-out (or orthogonalization , or residualization ) of x 1 x_1 with respect to x 2 x_2 . The idea is that we are isolating the variation in x 1 x_1 that is independent of (orthogonal to) x 2 x_2 . Note that x 2 x_2 can be also be multi-dimensional (i.e. include multiple variables and not just one).
Why would one ever do that?
This seems like a way more complicated procedure. Instead of simply doing the regression in 1 step, now we need to do 2 or even 3 steps. It’s not intuitive at all. The main advantage comes from the fact that we have reduced a multivariate regression to a univariate one, making more tractable and more intuitive.
Example1
Using the data from OVB example:
> # partial out unobserved1 (predictor) from price1 (predictor)
> fit_price <- lm (price1 ~ unobserved1, data = dat1)
>
> # regress demand1 (outcome) on price residuals
> lm (
+ demand1 ~ price_resid
+ , data = tibble:: tibble (demand1 = dat1$ demand1, price_resid = fit_price$ residuals)
+ ) %>%
+ broom:: tidy ()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 17.9 0.0830 216. 0
2 price_resid -0.443 0.0828 -5.35 0.000000135
> # partial out unobserved1 (predictor) from demand1 (outcome)
> fit_demand <- lm (demand1 ~ unobserved1, data = dat1)
>
> # regress demand residuals on price residuals
> lm (
+ demand_resid ~ price_resid
+ , data = tibble:: tibble (demand_resid = fit_demand$ residuals, price_resid = fit_price$ residuals)
+ ) %>%
+ broom:: tidy ()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1.37e-16 0.0445 3.08e-15 1.00e+ 0
2 price_resid -4.43e- 1 0.0444 -9.99e+ 0 1.59e-21
Partial Identification
In our linear regression models we include errors (aka noise) in the model of the outcome
y i = β 0 + β 1 x i + u i
y_i = \beta_0 + \beta_1x_i + u_i
where each sample u i u_i is drawn from a Gaussian distributions with zero mean and variance σ 2 \sigma^2 .
What if we are also uncertain about x i x_i ? In particular what if instead of x i x_i we can only use x ̃ i = x i + w i \tilde{x}_i=x_i + w_i where
Cov ( w , x ) = Cov ( w , u ) = 𝔼 [ w ] = 0
\text{Cov}\left(w,x\right)=\text{Cov}\left(w,u\right)=\mathbb{E}\left[w\right]=0
i.e. x ̃ i \tilde{x}_i reflects measurement error in the predictor x i x_i , where the measurement error has zero mean and is independent of x x and u u (e.g. the measurement error is an independent Gaussian).
In this case 𝔼 [ x ̃ ] = 𝔼 [ x + w ] = 𝔼 [ x ] \mathbb{E}\left[\tilde{x}\right]=\mathbb{E}\left[x+w\right]=\mathbb{E}\left[x\right] and
Cov ( x ̃ , y ) = Cov ( x + w , y ) = Cov ( x , y ) + Cov ( w , y ) = Cov ( x , y ) + Cov ( w , β 0 + β 1 x + u ) = Cov ( x , y ) + Cov ( w , u ) + β 1 Cov ( w , x ) = Cov ( x , y )
\begin{align*}
\text{Cov}\left(\tilde{x},y\right) & =\text{Cov}\left(x+w,y\right)=\text{Cov}\left(x,y\right)+\text{Cov}\left(w,y\right)\\
& =\text{Cov}\left(x,y\right)+\text{Cov}\left(w,\beta_{0}+\beta_{1}x+u\right)\\
& =\text{Cov}\left(x,y\right)+\text{Cov}\left(w,u\right)+\beta_{1}\text{Cov}\left(w,x\right)\\
& =\text{Cov}\left(x,y\right)
\end{align*}
however Var ( x ̃ ) = Var ( x + w ) ≥ Var ( x ) \text{Var}\left(\tilde{x}\right)=\text{Var}\left(x+w\right)\ge\text{Var}\left(x\right) so
Cov ( x ̃ , y ) Var ( x ̃ ) = Cov ( x , y ) Var ( x ) + Var ( w ) = Cov ( x , y ) / Var ( x ) 1 + Var ( w ) / Var ( x ) = β 1 1 + Var ( w ) / Var ( x )
\begin{align*}
\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} & =\frac{\text{Cov}\left(x,y\right)}{\text{Var}\left(x\right)+\text{Var}\left(w\right)}\\
& =\frac{\text{Cov}\left(x,y\right)/\text{Var}\left(x\right)}{1+\text{Var}\left(w\right)/\text{Var}\left(x\right)}\\
& =\frac{\beta_{1}}{1+\text{Var}\left(w\right)/\text{Var}\left(x\right)}
\end{align*}
and since Var ( w ) / Var ( x ) \text{Var}\left(w\right)/\text{Var}\left(x\right) is non-negative Cov ( x ̃ , y ) Var ( x ̃ ) \frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} has the same sign as β 1 \beta_1 and our data gives us a lower bound for β 1 \beta_1 :
| Cov ( x ̃ , y ) Var ( x ̃ ) | ≤ | β 1 |
\left|\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right|\le\left|\beta_{1}\right|
If we reverse the regression (regress x ̃ \tilde{x} on y y )
Cov ( x ̃ , y ) Var ( y ) = Cov ( x , y ) β 1 2 Var ( x ) + Var ( u ) = β 1 Var ( x ) β 1 2 Var ( x ) + Var ( u )
\begin{align*}
\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(y\right)} & =\frac{\text{Cov}\left(x,y\right)}{\beta_{1}^{2}\text{Var}\left(x\right)+\text{Var}\left(u\right)}\\
& =\frac{\beta_{1}\text{Var}\left(x\right)}{\beta_{1}^{2}\text{Var}\left(x\right)+\text{Var}\left(u\right)}
\end{align*}
and taking the reciprocal:
Var ( y ) Cov ( x ̃ , y ) = β 1 + Var ( u ) 𝛃 𝟏 𝐕𝐚𝐫 ( 𝐱 ) = β 1 [ 1 + Var ( u ) 𝛃 𝟏 𝟐 𝐕𝐚𝐫 ( 𝐱 ) ]
\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}=\beta_{1}+\frac{\text{Var}\left(u\right)}{\mathbf{\beta_{1}\text{Var}\left(x\right)}}=\beta_{1}\left[1+\frac{\text{Var}\left(u\right)}{\mathbf{\beta_{1}^{2}\text{Var}\left(x\right)}}\right]
and the factor in the brackets is greater than 1 and as before Cov ( x ̃ , y ) Var ( x ̃ ) \frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)} has the same sign as β 1 \beta_1 so
| Var ( y ) Cov ( x ̃ , y ) | ≥ | β 1 |
\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|\ge\left|\beta_{1}\right|
Some terminology:
a bound is sharp if it cannot be improved (under our assumptions)
a bound is tight if it is short enough to be useful in practice
We have sharp bounds for β 1 \beta_1 under measurement error:
| Var ( y ) Cov ( x ̃ , y ) | ≥ | β 1 | ≥ | Cov ( x ̃ , y ) Var ( x ̃ ) | ( 2 )
\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|\ge\left|\beta_{1}\right|\ge\left|\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right|
\qquad(2)
With respect to how tight the bounds are, let r r denote the correlation between x ̃ \tilde{x} and y y , then
r 2 = Cov ( x ̃ , y ) 2 Var ( x ̃ ) Var ( y ) = Cov ( x ̃ , y ) Var ( x ̃ ) × Cov ( x ̃ , y ) Var ( y ) r 2 × Var ( y ) Cov ( x ̃ , y ) = Cov ( x ̃ , y ) Var ( x ̃ )
\begin{align*}
r^{2} & =\frac{\text{Cov}\left(\tilde{x},y\right)^{2}}{\text{Var}\left(\tilde{x}\right)\text{Var}\left(y\right)}=\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\times\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(y\right)}\\
r^{2}\times\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)} & =\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}
\end{align*}
and the the width of the bounds in Equation 2 is
width = | Var ( y ) Cov ( x ̃ , y ) − Cov ( x ̃ , y ) Var ( x ̃ ) | = ( 1 − r 2 ) | Var ( y ) Cov ( x ̃ , y ) |
\text{width}=\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}-\frac{\text{Cov}\left(\tilde{x},y\right)}{\text{Var}\left(\tilde{x}\right)}\right|=\left(1-r^{2}\right)\left|\frac{\text{Var}\left(y\right)}{\text{Cov}\left(\tilde{x},y\right)}\right|
So the bounds for β 1 \beta_1 are tighter when x ̃ \tilde{x} and y y are strongly correlated.
see ovb