Causality Part 2

BSMM8740-2-R-2024F [WEEK - 8]

L.L. Odette

Recap of last week

  • Last week we introduced DAGs as a method to represent our causal assumptions and infer how we might estimate the causal effects of interest.
  • We worked through a method to estimate causal effects using IPW.

This week

We will look at other methods used to estimate causal effects, along with methods used for special situations, including:

  • regression adjustment
  • doubly robust estimation
  • matching
  • difference in differences
  • fixed effects and methods for panel data

Causal Effect Estimation

Inverse Probability Weighting

Last week we used IPW to create a pseudopopulation where, for every confounder level, the numbers of treated and untreated were balanced.

IPW requires us to build a model to predict the treatment, depending on the confounders (assuming we have data for all the confounders).

Inverse Probability Weighting

Repeating our prior process, first we calculate the ipw weights and estimate the ATE using the weights.

ATE estimate by IPW
dat_ <- causalworkshop::net_data |> dplyr::mutate(net = as.numeric(net))

propensity_model <- glm(
  net ~ income + health + temperature
  , data = dat_, family = binomial()
)

net_data_wts <- propensity_model |>
  broom::augment(newdata = dat_, type.predict = "response") |>
  # .fitted is the value predicted by the model
  # for a given observation
  dplyr::mutate(wts = propensity::wt_ate(.fitted, .exposure = net))

net_data_wts |>
  lm(malaria_risk ~ net, data = _, weights = wts) |>
  broom::tidy(conf.int = TRUE) |> gt::gt() |> gtExtras::gt_theme_espn()
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 42.74 0.4420 96.69 0.000e+00 41.87 43.61
net -12.54 0.6243 -20.10 5.498e-81 -13.77 -11.32

Inverse Probability Weighting

The function 1propensity::wt_ate calculates unstabilized weights by default. In the last lecture we looked at the distribution of weights and decided they were stable enough, because none of the weights were too big or too small.

Let’s explore the question of IPW stabilization a bit more.

Figure 1

Inverse Probability Weighting

Stabilized weights

  • The IP weights (given covariate set \(L\)) are \(W^A=1/\mathbb{E}[D|L]\). Because the denominator can be very close to \(0\) or \(1\) the estimates using these weights can be unstable.

  • Stabilized weights are often used to address this, e.g. the function propensity::wt_ate with stabilize = TRUE multiplies the weights by the mean of the treatment so in this case \(SW^A=\mathbb{E}[D]/\mathbb{E}[D|L]\).

Inverse Probability Weighting

V-Stabilized weights

Baseline covariates (\(V\subset L\)) are also used to stabilize IP weights: \(SW^A(V)=\mathbb{E}[D|V]/\mathbb{E}[D|L]\).

  • Note the the variables \(V\) need to be included in the numerator and the denominator (as part of \(L\))
  • V-stabilization results in IP weights that are more stabilized than the ones without V.

Inverse Probability Weighting

We know that IPW weighting should create a pseudo-population that make the covariates more balanced by treatment. In the last lecture we checked this using histograms. Here we check this via the statistics of the data:

Code
gtsummary::tbl_summary(
  net_data_wts 
  , by = net
  , include = c(net,income,health,temperature,malaria_risk)
) |>
  # add an overall column to the table
  gtsummary::add_overall(last = TRUE)

Characteristic

0
N = 1,298

1

1
N = 454

1

Overall
N = 1,752

1
income 880 (751, 1,010) 954 (806, 1,081) 893 (765, 1,031)
health 49 (35, 61) 54 (40, 68) 50 (37, 64)
temperature 23.9 (21.0, 27.1) 23.5 (20.6, 26.9) 23.8 (20.9, 27.1)
malaria_risk 41 (32, 54) 26 (20, 32) 36 (28, 50)
1

Median (Q1, Q3)

Code
net_svy <- survey::svydesign(
  ids = ~1,
  data = net_data_wts,
  weights = ~wts
)

gtsummary::tbl_svysummary(
  net_svy,
  by = net,
  include = c(net,income,health,temperature,malaria_risk)
) |>
  gtsummary::add_overall(last = TRUE)

Characteristic

0
N = 1,751

1

1
N = 1,760

1

Overall
N = 3,511

1
income 892 (767, 1,025) 892 (755, 1,039) 892 (761, 1,031)
health 50 (36, 63) 49 (37, 64) 50 (37, 64)
temperature 23.9 (20.9, 27.1) 23.8 (20.8, 27.3) 23.8 (20.8, 27.2)
malaria_risk 40 (31, 53) 28 (23, 34) 33 (26, 45)
1

Median (Q1, Q3)

Inverse Probability Weighting

Next we use bootstrapping to generate proper confidence intervals for the effect, first by creating a function to generate (unstabilized) ipw effect estimates for each bootstrap split:

fit_ipw <- function(split, ...) {
  # get bootstrapped data sample with `rsample::analysis()`
  if("rsplit" %in% class(split)){
    .df <- rsample::analysis(split)
  }else if("data.frame" %in% class(split)){
    .df <- split
  }

  # fit propensity score model
  propensity_model <- glm(
    net ~ income + health + temperature,
    data = .df,
    family = binomial()
  )

  # calculate inverse probability weights
  .df <- propensity_model |>
    broom::augment(type.predict = "response", data = .df) |>
    dplyr::mutate(wts = propensity::wt_ate(.fitted, net))

  # fit correctly bootstrapped ipw model
  lm(malaria_risk ~ net, data = .df, weights = wts) |>
    broom::tidy()
}

Inverse Probability Weighting

Next we use our function to bootstrap proper confidence intervals:

# create bootstrap samples
bootstrapped_net_data <- rsample::bootstraps(
  dat_,
  times = 1000,
  # required to calculate CIs later
  apparent = TRUE
)

# create ipw and fit each bootstrap sample
results <- bootstrapped_net_data |>
  dplyr::mutate(
    ipw_fits = purrr::map(splits, fit_ipw))

Regression Adjustment

IPW estimates rely on a model to predict the treatment using covariate/confounder values. We know that we can also predict the effect of treatment by building a model to predict the outcome using a regression model, regressing the effect on the treatment and covariate/confounder values.

regression adjustment
outcome_model <- glm(
  malaria_risk ~ net + income + health + temperature + insecticide_resistance +
    I(health^2) + I(temperature^2) + I(income^2),
  data = dat_
)

outcome_model |> broom::tidy(conf.int = TRUE)
# A tibble: 9 × 7
  term  estimate std.error statistic   p.value conf.low
  <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
1 (Int…  1.08e+2   3.55e+0    30.5   1.89e-164  1.01e+2
2 net   -1.24e+1   2.30e-1   -53.8   0         -1.28e+1
3 inco… -1.65e-1   4.58e-3   -36.1   3.48e-213 -1.74e-1
4 heal…  2.01e-1   2.82e-2     7.13  1.44e- 12  1.46e-1
5 temp…  7.69e-1   2.62e-1     2.93  3.42e-  3  2.55e-1
6 inse…  2.19e-1   6.89e-3    31.8   4.97e-175  2.05e-1
7 I(he… -6.60e-4   2.65e-4    -2.49  1.28e-  2 -1.18e-3
8 I(te…  5.01e-3   5.46e-3     0.919 3.58e-  1 -5.68e-3
9 I(in…  5.02e-5   2.50e-6    20.1   1.04e- 80  4.53e-5
# ℹ 1 more variable: conf.high <dbl>

Regression Adjustment

And we can also bootstrap the regression adjustment estimates to get confidence intervals, first by creating the estimation function, then generating bootstrapped estimates, like we did with IPWs:

fit_reg <- function(split, ...) {
  # get bootstrapped data sample with `rsample::analysis()`
  if("rsplit" %in% class(split)){
    .df <- rsample::analysis(split)
  }else if("data.frame" %in% class(split)){
    .df <- split
  }

  # fit outcome model
  glm(malaria_risk ~ net + income + health + temperature + insecticide_resistance +
      I(health^2) + I(temperature^2) + I(income^2), data = .df
    )|>
    broom::tidy()
}

both_results <- results |>
  dplyr::mutate(
    reg_fits = purrr::map(splits, fit_reg))

IPW vs Regression Adjustment

We can compare the results:

both_results_dat <- both_results |>
  dplyr::mutate(
    reg_estimate = purrr::map_dbl(
      reg_fits,
      # pull the `estimate` for net for each fit
      \(.fit) .fit |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    )
    , ipw_estimate = purrr::map_dbl(
      ipw_fits,
      # pull the `estimate` for net for each fit
      \(.fit) .fit |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    )
  ) 

IPW vs Regression Adjustment

We can plot the results:

Code
dat <- both_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

IPW vs Regression Adjustment

We can compare the means and the confidence intervals of the regression adjustment and IPW effect estimates:

compare IPW and regression effect estimates
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn()
method mean lower_ci upper_ci
ipw_estimate -12.56 -13.42 -11.66
reg_estimate -12.39 -12.98 -11.83

Note that each mean ATE estimate is within the CIs of the other ATE mean estimate.

IPW vs Regression Adjustment

There is a package (boot) that performs cross-validation and CI estimation at the same time:

use the boot package for CI estimation of regression adjustment
# bootstrap (1000 times) using the fit_reg function
boot_out_reg <- boot::boot(
  data = causalworkshop::net_data |> dplyr::mutate(net = as.numeric(net))
  , R = 1000
  , sim = "ordinary"
  , statistic =
    (\(x,y){ # x is the data, y is a vector of row numbers for the bootstrap sample
      fit_reg(x[y,]) |>
        dplyr::filter(term == "net") |>
        dplyr::pull(estimate)
    })
)
# calculate the CIs
CIs <- boot_out_reg |>
  boot::boot.ci(L = boot::empinf(boot_out_reg, index=1L, type="jack"))

tibble::tibble(CI = c("lower", "upper"), normal = CIs$normal[-1], basic = CIs$basic[-(1:3)]
               , percent = CIs$percent[-(1:3)]) |> gt::gt() |> gtExtras::gt_theme_espn()
CI normal basic percent
lower -12.89 -12.88 -12.93
upper -11.83 -11.82 -11.87

Doubly Robust Estimation

Don’t Put All your Eggs in One Basket

  • We’ve learned how to use linear regression and propensity score weighting to estimate \(E[Y|D=1] - E[Y|D=0] | X\). But which one should we use and when?

  • When in doubt, just use both! Doubly Robust Estimation is a way of combining propensity score and linear regression in a way you don’t have to rely on either of them.

Doubly Robust Estimation

Doubly Robust Estimation

The estimator is as follows.

\[ \hat{ATE} = \frac{1}{N}\sum \bigg( \dfrac{D_i(Y_i - \hat{\mu_1}(X_i))}{\hat{P}(X_i)} + \hat{\mu_1}(X_i) \bigg) - \frac{1}{N}\sum \bigg( \dfrac{(1-D_i)(Y_i - \hat{\mu_0}(X_i))}{1-\hat{P}(X_i)} + \hat{\mu_0}(X_i) \bigg) \]

where

  • \(\hat{P}(x)\) is an estimation of the propensity score (using logistic regression, for example),

  • \(\hat{\mu_1}(x)\) is an estimation of \(E[Y|X, D=1]\) (using linear regression, for example).

The first part of the doubly robust estimator estimates \(E[Y^1]\) and the second part estimates \(E[Y^0]\).

Doubly Robust Estimation

First let’s examine the code. Note we use a recipe with the regression to add the polynomial confounders for each

D <- "net"
Y <- "malaria_risk"
X <- paste0(c('income', 'health', 'temperature'),c(rep('_poly_1',3),rep('_poly_2',3)))

doubly_robust_rec <- causalworkshop::net_data |> 
  dplyr::mutate(net = as.numeric(net)) |> 
  recipes::recipe(malaria_risk ~ net + income + health + temperature) |> 
  # NOTE: these are orthogonal polynomial terms, were not used in regression adjustment earlier
  recipes::step_poly(income, health, temperature) |> 
  recipes::prep() 

doubly_robust_dat <- doubly_robust_rec |> recipes::bake(new_data=NULL)
doubly_robust <- function(df, X, D, Y){
  ps <- # propensity score
    as.formula(paste(D, " ~ ", paste(X, collapse= "+"))) |>
    stats::glm( data = df, family = binomial() ) |>
    broom::augment(type.predict = "response", data = df) |>
    dplyr::pull(.fitted)
  
  lin_frml <- formula(paste(Y, " ~ ", paste(X, collapse= "+")))
  
  idx <- df[,D] |> dplyr::pull(1) == 0
  mu0 <- # mean response D == 0
    lm(lin_frml, data = df[idx,]) |> 
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  idx <- df[,D] |> dplyr::pull(1) == 1
  mu1 <- # mean response D == 1
    lm(lin_frml, data = df[idx,]) |>  
    broom::augment(type.predict = "response", newdata = df[,X]) |> 
    dplyr::pull(.fitted)
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}
doubly_robust_dat |> 
  doubly_robust(X, D, Y)
[1] -12.9

Doubly Robust Estimation

Once again, we can use bootstrap to construct confidence intervals.

use bootstrap fold to estimate using DRE
all_results_dat <- both_results_dat |>
  dplyr::mutate(
    dre_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
            doubly_robust_rec |> 
              recipes::bake(new_data = rsample::analysis(x)) |> 
              doubly_robust(X, D, Y)
        })
      )
  )

all_results_dat
# Bootstrap sampling with apparent sample 
# A tibble: 1,001 × 7
   splits             id            ipw_fits reg_fits
   <list>             <chr>         <list>   <list>  
 1 <split [1752/647]> Bootstrap0001 <tibble> <tibble>
 2 <split [1752/662]> Bootstrap0002 <tibble> <tibble>
 3 <split [1752/636]> Bootstrap0003 <tibble> <tibble>
 4 <split [1752/642]> Bootstrap0004 <tibble> <tibble>
 5 <split [1752/621]> Bootstrap0005 <tibble> <tibble>
 6 <split [1752/640]> Bootstrap0006 <tibble> <tibble>
 7 <split [1752/628]> Bootstrap0007 <tibble> <tibble>
 8 <split [1752/652]> Bootstrap0008 <tibble> <tibble>
 9 <split [1752/663]> Bootstrap0009 <tibble> <tibble>
10 <split [1752/649]> Bootstrap0010 <tibble> <tibble>
# ℹ 991 more rows
# ℹ 3 more variables: reg_estimate <dbl>,
#   ipw_estimate <dbl>, dre_estimate <dbl>

Doubly Robust Estimation

Looking at the histogram/distributions of all the estimates:

Code
dat <- all_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate, dre_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

Confidence intervals, including for the dre estimate are:

compute CIs for IPW, rgression and DRE
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn()
method mean lower_ci upper_ci
dre_estimate -12.91 -13.62 -12.28
ipw_estimate -12.56 -13.42 -11.66
reg_estimate -12.39 -12.98 -11.83

All methods are consistent.

Doubly Robust Estimation

The doubly robust estimator is called doubly robust because it only requires one of the models, \(\hat{P}(x)\) or \(\hat{\mu}(x)\), to be correctly specified.

Look at the first part that estimates \(E[Y^1]\)

\[ \hat{E}[Y^1] = \frac{1}{N}\sum \bigg( \dfrac{D_i(Y_i - \hat{\mu_1}(X_i))}{\hat{P}(X_i)} + \hat{\mu_1}(X_i) \bigg) \]

Assume that \(\hat{\mu_1}(x)\) is correct. If the propensity score model is wrong, we wouldn’t need to worry. Because if \(\hat{\mu_1}(x)\) is correct, then \(E[D_i(Y_i - \hat{\mu_1}(X_i))]=0\). That is because the multiplication by \(D_i\) selects only the treated and the residual of \(\hat{\mu_1}\) on the treated have, by definition, mean zero.

This causes the whole thing to reduce to \(\hat{\mu_1}(X_i)\), which is correctly estimated \(E[Y^1]\) by assumption. Similar reasoning applies to the estimator of \(E[Y^0]\).

Doubly Robust Estimation

Here we deliberately bias the IPW estimate by replacing the propensity score by a random uniform variable that goes from 0.1 to 0.9 (we don’t want very small weights to blow up the propensity score variance). Since this is random, there is no way it is a good propensity score model, but the doubly robust estimator still manages to produce an estimation that is very close to when the propensity score was estimated with logistic regression.

define a DRE with a bad IPW model
doubly_robust_bad_ipw <- function(df, X, D, Y){
  ps <- runif(dim(df)[1], 0.1, 0.9) # wrong propensity score 
  
  lin_frml <- formula(paste(Y, " ~ ", paste(X, collapse= "+")))
  
  idx <- df[,D] |> dplyr::pull(1) == 0
  mu0 <- # mean response D == 0
    lm(lin_frml, data = df[idx,]) |>
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  idx <- df[,D] |> dplyr::pull(1) == 1
  mu1 <- # mean response D == 1
    lm(lin_frml, data = df[idx,]) |>
    broom::augment(type.predict = "response", newdata = df[,X]) |>
    dplyr::pull(.fitted)
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}

Doubly Robust Estimation

Code
all_results_dat <- all_results_dat |>
  dplyr::mutate(
    dre_bad_ipw_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
          # rsample::analysis(x) |> 
          #   # dplyr::mutate(net = as.numeric(net)) |> # not needed
          #   recipes::bake(new_data = rsample::analysis(x)) |> 
          doubly_robust_rec |> 
            recipes::bake(new_data = rsample::analysis(x)) |> 
            doubly_robust_bad_ipw(X, D, Y)
        })
      )
  )

dat <- all_results_dat |> 
  dplyr::select(reg_estimate, ipw_estimate, dre_estimate, dre_bad_ipw_estimate) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

summarize the models so far.
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn()
method mean lower_ci upper_ci
dre_bad_ipw_estimate -12.93 -13.71 -12.16
dre_estimate -12.91 -13.62 -12.28
ipw_estimate -12.56 -13.42 -11.66
reg_estimate -12.39 -12.98 -11.83

Doubly Robust Estimation

Messing up the propensity score yields slightly different ATEs, but not by much. Now let’s again take a good look at the first part of the estimator, rearranging some terms:

\[ \begin{align*} \hat{E}[Y^{1}] & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}(Y_{i}-\hat{\mu_{1}}(X_{i}))}{\hat{P}(X_{i})}+\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\dfrac{D_{i}\hat{\mu_{1}}(X_{i})}{\hat{P}(X_{i})}+\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\bigg(\dfrac{D_{i}}{\hat{P}(X_{i})}-1\bigg)\hat{\mu_{1}}(X_{i})\bigg)\\ & =\frac{1}{N}\sum\bigg(\dfrac{D_{i}Y_{i}}{\hat{P}(X_{i})}-\bigg(\dfrac{D_{i}-\hat{P}(X_{i})}{\hat{P}(X_{i})}\bigg)\hat{\mu_{1}}(X_{i})\bigg) \end{align*} \]

Doubly Robust Estimation

Assume that the propensity score \(\hat{P}(X_i)\) is correctly specified. In this case, \(E[D_i - \hat{P}(X_i)]=0\), which wipes out the part dependent on \(\hat{\mu_1}(X_i)\). This reduces the doubly robust estimator to the propensity score weighting estimator \(\frac{D_iY_i}{\hat{P}(X_i)}\), which is correct by assumption.

So, even if the \(\hat{\mu_1}(X_i)\) is wrong, the estimator will still be correct, provided that the propensity score is correctly specified.

Doubly Robust Estimation

define a DRE with a bad regression model
doubly_robust_bad_reg <- function(df, X, D, Y){
  ps <- # propensity score
    as.formula(paste(D, " ~ ", paste(X, collapse= "+"))) |>
    stats::glm( data = df, family = binomial() ) |>
    broom::augment(type.predict = "response", data = df) |>
    dplyr::pull(.fitted)
  
  mu0 <- rnorm(dim(df)[1], 0, 1) # wrong mean response D == 0
  mu1 <- rnorm(dim(df)[1], 0, 1) # wrong mean response D == 1
  
  # convert treatment factor to integer | recast as vectors
  d <- df[,D] |> dplyr::pull(1) |> as.character() |> as.numeric()
  y <- df[,Y] |> dplyr::pull(1)
  
  mean( d*(y - mu1)/ps + mu1 ) -
    mean(( 1-d)*(y - mu0)/(1-ps) + mu0 )
}

Doubly Robust Estimation

Code
all_results_dat <- all_results_dat |>
  dplyr::mutate(
    dre_bad_reg_estimate = 
      purrr::map_dbl(
        splits
        , (\(x){
          # rsample::analysis(x) |> 
          #   # dplyr::mutate(net = as.numeric(net)) |> # not needed
          #   recipes::bake(new_data = rsample::analysis(x)) |> 
          doubly_robust_rec |> 
            recipes::bake(new_data = rsample::analysis(x)) |> 
            doubly_robust_bad_reg(X, D, Y)
        })
      )
  )

dat <- all_results_dat |> 
  dplyr::select(
    reg_estimate, ipw_estimate, dre_estimate, dre_bad_ipw_estimate, dre_bad_reg_estimate
  ) |> 
  tidyr::pivot_longer(cols=everything(), names_to = "method", values_to = "effect estimate")

dat |>
  dplyr::mutate(method=factor(method)) |>
  ggplot( bins = 50, alpha = .5
    , aes(
      `effect estimate`, after_stat(density) , colour = method, fill = method
      )
  ) +
  geom_histogram(alpha = 0.2, position = 'identity') +
  geom_density(alpha = 0.2) +
  theme(legend.text=element_text(size=10), legend.title = element_blank())

Doubly Robust Estimation

Once again, we can use bootstrap and see that the variance is just slightly higher.

compare the results so far
dat |> 
  dplyr::group_by(method) |> 
  dplyr::summarise(
  mean = mean(`effect estimate`)
  , lower_ci = quantile(`effect estimate`, 0.025)
  , upper_ci = quantile(`effect estimate`, 0.975)
) |> gt::gt() |> gtExtras::gt_theme_espn()
method mean lower_ci upper_ci
dre_bad_ipw_estimate -12.93 -13.71 -12.16
dre_bad_reg_estimate -12.75 -13.56 -11.95
dre_estimate -12.91 -13.62 -12.28
ipw_estimate -12.56 -13.42 -11.66
reg_estimate -12.39 -12.98 -11.83

Doubly Robust Estimation

Once more, messing up the conditional mean model alone yields only slightly different ATE. The magic of doubly robust estimation happens because in causal inference, there are two ways to remove bias from our causal estimates: you either model the treatment mechanism or the outcome mechanism. If either of these models are correct, you are good to go.

One caveat is that, in practice, it’s very hard to model precisely either of those. More often, what ends up happening is that neither the propensity score nor the outcome model are 100% correct. They are both wrong, but in different ways. When this happens, it is not exactly settled [1] [2] [3] if it’s better to use a single model or doubly robust estimation. At least it gives you two possibilities of being correct.

Finite Sample Bias

Finite Sample Bias

We know that not accounting for confounders or blocking open backdoor paths can bias our causal estimates, but it turns out that even after accounting for all confounders, we may still get a biased estimate with finite samples. Many of the properties we tout in statistics rely on large samples—how “large” is defined can be opaque. Let’s look at a quick simulation. Here, we have an exposure/treatment, \(X\), an outcome, \(Y\), and one confounder, \(Z\). We will simulate \(Y\), which is only dependent on \(Z\) (so the true treatment effect is 0), and \(X\), which also depends on \(Z\).

\[ \begin{align*} Z &\sim \mathscr{N}(0,1)\\ X & = \mathrm{ifelse}(0.5+Z>0,1,0)\\ Y & = Z + \mathscr{N}(0,1) \end{align*} \]

Finite Sample Bias

Sampling code

set.seed(928)
n <- 100
finite_sample <- tibble::tibble(
  # z is normally distributed with a mean: 0 and sd: 1
  z = rnorm(n),
  # x is defined from a probit selection model with normally distributed errors
  x = dplyr::case_when(
    0.5 + z + rnorm(n) > 0 ~ 1,
    TRUE ~ 0
  ),
  # y is continuous, dependent only on z with normally distrbuted errors
  y = z + rnorm(n)
)

Finite Sample Bias

If we fit a propensity score model using the one confounder \(Z\) and calculate the weighted estimator, we should get an unbiased result (which in this case would be \(0\)).

fit the propensity score model with finite samples
## fit the propensity score model
finite_sample_wts <- glm(
  x ~ z,
  data = finite_sample,
  family = binomial("probit")
) |>
  broom::augment(newdata = finite_sample, type.predict = "response") |>
  dplyr::mutate(wts = propensity::wt_ate(.fitted, x))

finite_sample_wts |>
  dplyr::summarize(
    effect = sum(y * x * wts) / sum(x * wts) -
      sum(y * (1 - x) * wts) / sum((1 - x) * wts)
  )
# A tibble: 1 × 1
  effect
   <dbl>
1  0.197

Finite Sample Bias

Our effect of is pretty far from 0, although it’s hard to know if this is really bias, or something we are just seeing by chance in this particular simulated sample.

To explore the potential for finite sample bias, we can rerun this simulation many times at different sample sizes:

fit the propensity score model with different number of finite samples
sim <- function(n) {
  ## create a simulated dataset
  finite_sample <- tibble::tibble(
    z = rnorm(n),
    x = dplyr::case_when(
      0.5 + z + rnorm(n) > 0 ~ 1,
      TRUE ~ 0
    ),
    y = z + rnorm(n)
  )
  finite_sample_wts <- glm(
    x ~ z,
    data = finite_sample,
    family = binomial("probit")
  ) |>
    broom::augment(newdata = finite_sample, type.predict = "response") |>
    dplyr::mutate(wts = propensity::wt_ate(.fitted, x))
  bias <- finite_sample_wts |>
    dplyr::summarize(
      effect = sum(y * x * wts) / sum(x * wts) -
        sum(y * (1 - x) * wts) / sum((1 - x) * wts)
    ) |>
    dplyr::pull()
  tibble::tibble(
    n = n,
    bias = bias
  )
}

## Examine 5 different sample sizes, simulate each 1000 times
set.seed(1)
finite_sample_sims <- purrr::map_df(
  rep(
    c(50, 100, 500, 1000, 5000, 10000),
    each = 1000
  ),
  sim
)

bias <- finite_sample_sims |>
  dplyr::group_by(n) |>
  dplyr::summarise(bias = mean(bias))

Finite Sample Bias

Plotting the results:

Finite sample bias present with ATE weights created using correctly specified propensity score model, varying the sample size from n = 50 to n = 10,000

Finite Sample Bias

This is demonstrates finite sample bias. Notice that even when the sample size is quite large (5,000) we still see some bias away from the “true” effect of 0. It isn’t until a sample size larger than 10,000 that we see this bias disappear.

Estimands that utilize weights that are unbounded (i.e. that theoretically can be infinitely large) are more likely to suffer from finite sample bias. The likelihood of falling into finite sample bias depends on:

  1. the estimand you have chosen (i.e. are the weights bounded?)
  2. the distribution of the covariates in the exposed and unexposed groups (i.e. is there good overlap? Potential positivity violations, when there is poor overlap, are the regions where weights can become too large)
  3. the sample size.

Matching

Matching

Regression is good at controlling for additional variables when we do a test vs control comparison. If we have independence, \((Y^0, Y^1)\perp D | X\), then regression can identify the ATE by controlling for \(X\).

To get some intuition about controlling for \(X\), let’s remember the case when all variables \(X\) are dummy variables.

If that is the case, regression partitions the data into the dummy cells and computes the mean difference between test and control. Effectively we are calculating doing

\[ E[Y|D=1, X=x] - E[Y|D=0, X=x] \]

where \(x\) is a dummy cell (all dummies set to 1, for example).

Regression then combines the estimate in each of the cells to produce a final ATE. The way it does this is by applying weights to the cell proportional to the variance of the treatment on that group.

Matching

To give an example, suppose we try to estimate the effect of a drug and I have drug data for 6 men and 4 women. The response variable is days hospitalised and I hope my drug can lower that. On men, the true causal effect is -3, so the drug lowers the stay period by 3 days. On women, it is -2.

To make matters more interesting, men are much more affected by this illness and stay longer at the hospital. They also get much more of the drug. Only 1 out of the 6 men does not get the drug. On the other hand, women are more resistant to this illness, so they stay less at the hospital. 50% of the women get the drug.

sex drug days
M 1 5
M 1 5
M 1 5
M 1 5
M 1 5
M 0 8
W 1 2
W 0 4
W 1 2
W 0 4

Matching

Note that simple comparison of treatment and control yields a negatively biased effect, that is, the drug seems less effective than it truly is. This is expected, since we’ve omitted the sex confounder.

In this case, the estimated ATE is smaller than the true one because men get more of the drug and are more affected by the illness.

drug mean_effect ATE
0 5.333 NA
1 4.143 -1.19

Matching

Since the true effect for men is -3 and the true effect for women is -2, the ATE should be

\[ ATE=\dfrac{(-3*6) + (-2*4)}{10}=-2.6 \]

This estimate is done by

  1. partitioning the data into confounder cells, in this case, men and women,
  2. estimating the effect on each cell and
  3. combining the estimate with a weighted average, where the weight is the sample size of the cell or covariate group.

Matching

If we had exactly the same number of men and women in the data, the ATE estimate would be right in the middle of the ATE of the 2 groups, -2.5. Since there are more men than women in our dataset, the ATE estimate is a little bit closer to the men’s ATE.

This is called a non-parametric estimate, since it places no assumption on how the data was generated.

If we control for sex using regression, we will add the assumption of linearity. Regression will also partition the data into men and women and estimate the effect on both of these groups.

So far, so good. However, when it comes to combining the effect on each group, it does not weigh them by the sample size.

Matching

Instead, regression uses weights that are proportional to the variance of the treatment in that group. In our case, the variance of the treatment in men is smaller than in women, since only one man is in the control group.

To be exact, the variance of D for men is \(0.139=1/6*(1 - 1/6)\) and for women is \(0.25=2/4*(1 - 2/4)\).

So regression will give a higher weight to women in our example and the ATE will be a bit closer to the women’s ATE of -2.

Regression summary (ATE)
days ~ drug + sex
term estimate std.error statistic p.value
(Intercept) 7.5455 0.188 40.093 0.000
drug −2.4545 0.188 −13.042 0.000
sexW −3.3182 0.176 −18.849 0.000

Matching

The result is more intuitive with dummy variables, but regression also keeps continuous variables constant while estimating the effect.

Also, with continuous variables, the ATE will point in the direction where covariates have more variance.

So we’ve seen that regression has its idiosyncrasies. It is linear, parametric, likes high variance features… This can be good or bad, depending on the context.

Because of this, it’s important to be aware of other techniques we can use to control for confounders. Not only are they an extra tool in your causal tool belt, but understanding different ways to deal with confounding expands our understanding of the problem.

For this reason, we’ll now examne the Subclassification Estimator!

Matching: subclassification

In general, if there is some causal effect we want to estimate, but it is hard to do so because of confounding of some variables X, what we need to do is make the treatment vs control comparison within small groups where X is the same.

If we have conditional independence \((Y^0, Y^1)\perp D | X\) , then we can write the ATE as follows.

\[ ATE = \int(E[Y|X=x, D=1] - E[Y|X=x, D=0])dP(x) \]

What the integral does is it goes through all the space of the distribution of features X, computes the difference in means for all those tiny spaces and combines everything into the ATE.

Matching: subclassification

Another way to see this is to think about a discrete set of features.

In this case, we can say that values the feature set X takes on falls into K different sets \(\{X_1, X_2, ..., X_k\}\) and what we are doing is computing the treatment effect in each set and combining them into the ATE.

In this discrete case, converting the integral to a sum, we can derive the subclassifications estimator:

\[ \hat{ATE} = \sum^K_{k=1}(\bar{Y}^1_k - \bar{Y}^0_k) * \dfrac{N_k}{N} \]

Matching: subclassification

\[ \hat{ATE} = \sum^K_{k=1}(\bar{Y}^1_k - \bar{Y}^0_k) * \dfrac{N_k}{N} \]

As you can see, we are computing a local ATE for each cell and combining them using a weighted average, where the weights are the sample size of the cell. In our medicine example above, this would be the first estimate, which gave us −2.6.

The subclassification estimator isn’t used much in practice, because of the curse of dimensionality.

Matching estimator

The subclassification estimator gives us a nice intuition of what a causal inference estimator should do, how it should control for confounders.

This allows us to explore other kinds of estimators, such as the Matching Estimator.

Matching estimator

When some confounder X makes it so that treated and untreated are not initially comparable, we can make them so by matching each treated unit with a similar untreated unit - finding an untreated twin for every treated unit. In making such comparisons, treated and untreated become again comparable.

As an example, let’s suppose we are trying to estimate the effect of a trainee program on earnings. Here is what the trainees looks like:

Code
data_trainees <- readr::read_csv("data/trainees.csv", show_col_types = FALSE)

data_trainees |> 
  dplyr::filter(trainees==1) |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
unit trainees age earnings
1 1 28 17700
2 1 34 10200
3 1 29 14400
4 1 25 20800
5 1 29 6100
Code
data_trainees |> 
  dplyr::filter(trainees==0) |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
unit trainees age earnings
20 0 43 20900
21 0 50 31000
22 0 30 21000
23 0 27 9300
24 0 54 41100

Matching estimator

A simple comparison in means, identifies that the trainees earn less money than those that didn’t go through the program.

Code
data_trainees |> 
  dplyr::group_by(trainees) |> 
  dplyr::summarize(mean_effect = mean(earnings)) |> 
  dplyr::mutate(ATE = mean_effect - dplyr::lag( mean_effect) ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
trainees mean_effect ATE
0 20724 NA
1 16426 -4297

However, if we look at the data tables, we notice that trainees are much younger than non trainees, which indicates that age is probably a confounder.

Matching estimator

We can use matching on age to try to correct that.

We will take unit 1 from the treated and pair it with unit 27, since both are 28 years old. Unit 2 we will pair it with unit 34, unit 3 with unit 37, unit 4 we will pair it with unit 35… When it comes to unit 5, we need to find someone with age 29 from the non treated, but that is unit 37, which is already paired.

This is not a problem, since we can use the same unit multiple times. If more than 1 unit is a match, we can choose randomly between them.

Code
unique_on_age <- data_trainees |> 
  dplyr::filter(trainees == 0) |> 
  dplyr::distinct(age, .keep_all = TRUE)

matches <- data_trainees |> 
  dplyr::filter(trainees == 1) |> 
  dplyr::left_join(unique_on_age, by = 'age', suffix = c("_t_1", "_t_0")) |> 
  dplyr::mutate(t1_minus_t0 = earnings_t_1 - earnings_t_0) 

matches |> dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
unit_t_1 trainees_t_1 age earnings_t_1 unit_t_0 trainees_t_0 earnings_t_0 t1_minus_t0
1 1 28 17700 27 0 8800 8900
2 1 34 10200 34 0 24200 -14000
3 1 29 14400 37 0 6200 8200
4 1 25 20800 35 0 23300 -2500
5 1 29 6100 37 0 6200 -100

Matching estimator

If we take the mean of this last column we get the ATET estimate while controlling for age. Notice how the estimate is now very positive, compared to the previous one where we used a simple difference in means.

Code
matches |> dplyr::summarize(ATE = mean(t1_minus_t0)) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
ATE
2458

This is a contrived example, just to introduce matching.

Matching estimator

In practice, we usually have more than one feature and units don’t match perfectly. In this case, we have to define some measurement of proximity to compare how units are close to each other.

One common metric for this is the euclidean norm \(||X_i - X_j||\). This difference, however, is not invariant to the scale of the features.

This means that features like age, that take values on the tenths, will be much less important when computing this norm compared to features like income, which take the order of hundreds.

For this reason, before applying the norm, we need to scale the features so that they are on roughly the same scale.

Matching estimator

Having defined a distance measure, we can now define the match as the nearest neighbour to that sample we wish to match.

We can write the matching estimator the following way:

\[ \hat{ATE} = \frac{1}{N} \sum^N_{i=1} (2D_i - 1)\big(Y_i - Y_{jm}(i)\big) \]

Where \(Y_{jm}(i)\) is the sample from the other treatment group which is most similar to \(Y_i\).

We scale by \(2D_i - 1\) to match both ways: treated with controls and controls with the treatment.

Matching estimator

To test this estimator, let’s consider a medicine example.

Once again, we want to find the effect of a medication on days until recovery.

Unfortunately, this effect is confounded by severity, sex and age. We have reasons to believe that patients with more severe conditions have a higher chance of receiving the medicine.

Code
data_med <- readr::read_csv("data/medicine_impact_recovery.csv", show_col_types = FALSE)

data_med |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
sex age severity medication recovery
0 35.05 0.8877 1 31
1 41.58 0.8998 1 49
1 28.13 0.4863 0 38
1 36.38 0.3231 0 35
0 25.09 0.2090 0 15

Matching estimator

If we look at a simple difference in means, \(E[Y|D=1]-E[Y|D=0]\), we get that the treated take, on average, 16.9 more days to recover than the untreated.

This is probably due to confounding, since we don’t expect the medicine to cause harm to the patient.

Code
data_med |> 
  dplyr::group_by(medication) |> 
  dplyr::summarize(mean_effect = mean(recovery)) |> 
  dplyr::mutate(ATE = mean_effect - dplyr::lag( mean_effect) ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
medication mean_effect ATE
0 21.68 NA
1 38.57 16.9

Matching estimator

To correct for this bias, we will control for X using matching.

First, we need to remember to scale our features, otherwise, features like age will have higher importance than features like severity when we compute the distance between points.

To do so, we can standardise the features.

Code
data_med <- data_med |> recipes::recipe(recovery ~ .) |> 
  recipes::update_role(medication, new_role = 'treatment') |> 
  recipes::step_normalize(recipes::all_predictors()) |> 
  recipes::prep() |> 
  recipes::bake(new_data=NULL)

data_med |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
sex age severity medication recovery
-0.997 0.2808 1.4598 1 31
1.003 0.8654 1.5022 1 49
1.003 -0.3387 0.0578 0 38
1.003 0.3995 -0.5126 0 35
-0.997 -0.6105 -0.9111 0 15

Matching estimator

Instead of coding a matching function, we will use the K nearest neighbour algorithm from caret::knnreg.

This algorithm makes predictions by finding the nearest data point in an estimation or training set.

For matching, we will need 2 of those. One, \(mt_0\), will store the untreated points and will find matches in the untreated when asked to do so.

The other, \(mt_1\), will store the treated point and will find matches in the treated when asked to do so. After this fitting step, we can use these KNN models to make predictions, which will be our matches.

Matching estimator

Code
treated   <- data_med |> dplyr::filter(medication==1)
untreated <- data_med |> dplyr::filter(medication==0)

mt0 <- # untreated knn model predicting recovery
  caret::knnreg(x = untreated |> dplyr::select(sex,age,severity), y = untreated$recovery, k=1)
mt1 <- # treated knn model predicting recovery
  caret::knnreg(x = treated |> dplyr::select(sex,age,severity), y = treated$recovery, k=1)

predicted <-
  # combine the treated and untreated matches
  c(
    # find matches for the treated looking at the untreated knn model
    treated |>
      tibble::rowid_to_column("ID") |> 
      {\(y)split(y,y$ID)}() |> # hack for native pipe 
      # split(.$ID) |>         # this vesion works with magrittr
      purrr::map(
        (\(x){
          x |> 
            dplyr::mutate(
              match = predict( mt0, x[1,c('sex','age','severity')] )
            )
        })
      )
    # find matches for the untreated looking at the treated knn model
    , untreated |>
        tibble::rowid_to_column("ID") |> 
        {\(y)split(y,y$ID)}() |> 
        # split(.$ID) |> 
        purrr::map(
          (\(x){
            x |> 
              dplyr::mutate(
                match = predict( mt1, x[1,c('sex','age','severity')] )
              )
          })
        )
  ) |>
  # bind the treated and untreated data 
  dplyr::bind_rows()

predicted |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('sex','age','severity'), decimals = 6) |>
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
ID sex age severity medication recovery match
1 −0.996980 0.280787 1.459800 1 31 39
2 1.002979 0.865375 1.502164 1 49 52
3 −0.996980 1.495134 1.268540 1 38 46
4 1.002979 −0.106534 0.545911 1 34 45
5 −0.996980 0.043034 1.428732 1 30 39

Matching estimator

With the matches, we can now apply the matching estimator formula

\[ \hat{ATE} = \frac{1}{N} \sum^N_{i=1} (2D_i - 1)\big(Y_i - Y_{jm}(i)\big) \]

Code
predicted |> 
  dplyr::summarize(
    "ATE (est)" = mean( (2*medication - 1) * (recovery - match) )
  ) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn() |> 
  gt::as_raw_html()
ATE (est)
-0.9954

Using this sort of matching, we can see that the effect of the medicine is not positive anymore.

This means that, controlling for \(X\), the medicine reduces the recovery time by about 1 day, on average.

This is already a huge improvement on top of the biased estimate that predicted a 16.9 increase in recovery time.

Matching bias

It turns out the matching estimator as we’ve designed above is biased.

To see this, let’s consider the ATET estimator, instead of the ATE, just because it is simpler to write.

The intuition will apply to the ATE as well.

\[ \hat{ATET} = \frac{1}{N_1}\sum (Y_i - Y_j(i)) \]

where \(N_1\) is the number of treated individuals and \(Y_j(i)\) is the untreated match of treated unit i.

Matching bias

To check for bias, what we do is hope we can apply the Central Limit Theorem so that this estimate converges to a normal distribution with mean zero.

\[ \sqrt{N_1}(\hat{ATET} - ATET) \]

However, this doesn’t alway happen. If we define the mean outcome for the untreated given X, \(\mu_0(x)=E[Y|X=x, D=0]\), we will have that

\[ \begin{align*} \mathbb{E}\left [\sqrt{N_1}(\hat{ATET} - ATET)\right] & = \\ \mathbb{E}\left[\sqrt{N_1}(\mu_0(X_i) - \mu_0(X_j(i))\right] \end{align*} \]

Matching bias

Now, \(\mu_0(X_i) - \mu_0(X_j(i))\) is not so simple to understand, so let’s look at it more carefully.

  • \(\mu_0(X_i)\) is the outcome Y value of a treated unit \(i\) had it not been treated. It is the counterfactual outcome \(Y^0\) for unit \(i\).
  • \(\mu_0(X_j(i))\) is the outcome of the untreated unit \(j\) that is the match of unit \(i\).
  • It is also the \(Y^0\), but for unit \(j\) now.
  • Only this time, it is a factual outcome, because \(j\) is in the non treated group.
  • Now, because \(j\) and \(i\) are only similar, but not the same, this will likely not be zero. In other words, \(X_i \approx X_j\), so, \(Y^0_i \approx Y^0_j\).

Matching bias

As we increase the sample size, there will be more units to match, so the difference between unit \(i\) and its match \(j\) will also get smaller.

But this difference converges to zero slowly.

As a result \(E[\sqrt{N_1}(\mu_0(X_i) - \mu_0(X_j(i)))]\) may not converge to zero, because the \(\sqrt{N_1}\) grows faster than \((\mu_0(X_i) - \mu_0(X_j(i)))\) diminishes.

Matching bias

Bias arises when the matching discrepancies are huge. Fortunately, we know how to correct it. Each observation contributes \((\mu_0(X_i) - \mu_0(X_j(i))\) to the bias so all we need to do is subtract this quantity from each matching comparison in our estimator.

To do so, we can replace \(\mu_0(X_j(i))\) with some sort of estimate of this quantity \(\hat{\mu}_0(X_j(i))\), which can be obtained with models like linear regression. This updates the ATET estimator to the following equation:

\[ \hat{ATET} = \frac{1}{N_1}\sum \big((Y_i - Y_{j(i)}) - (\hat{\mu_0}(X_i) - \hat{\mu_0}(X_{j(i)})\big) \]

where \(\hat{\mu_0}(x)\) is some estimative of \(E[Y|X, D=0]\), like a linear regression fitted only on the untreated sample.

Matching bias

Code
ols0 <- lm(recovery ~ sex + age + severity, data = untreated) 
ols1 <- lm(recovery ~ sex + age + severity, data = treated) 

# find the units that match to the treated
treated_match_index <- # RANN::nn2 does Nearest Neighbour Search
  (RANN::nn2(mt0$learn$X, treated |> dplyr::select(sex,age,severity), k=1))$nn.idx |> 
  as.vector()

# find the units that match to the untreated
untreated_match_index <- # RANN::nn2 does Nearest Neighbour Search
  (RANN::nn2(mt1$learn$X, untreated |> dplyr::select(sex,age,severity), k=1))$nn.idx |> 
  as.vector()

predicted <- 
c(
  purrr::map2(
    .x = 
      treated |> tibble::rowid_to_column("ID") |> {\(y)split(y,y$ID)}() # split(.$ID) 
    , .y = treated_match_index
    , .f = (\(x,y){
          x |> 
            dplyr::mutate(
              match = predict( mt0, x[1,c('sex','age','severity')] )
              , bias_correct =
                  predict( ols0, x[1,c('sex','age','severity')] ) -
                  predict( ols0, untreated[y,c('sex','age','severity')] )
            )
        })
  )
  , purrr::map2(
    .x = 
      untreated |> tibble::rowid_to_column("ID") |> {\(y)split(y,y$ID)}() # split(.$ID) 
    , .y = untreated_match_index
    , .f = (\(x,y){
          x |> 
            dplyr::mutate(
              match = predict( mt1, x[1,c('sex','age','severity')] )
              , bias_correct =
                  predict( ols1, x[1,c('sex','age','severity')] ) -
                  predict( ols1, treated[y,c('sex','age','severity')] )
            )
        })
  )
) |> 
  # bind the treated and untreated data 
  dplyr::bind_rows()

predicted |>   
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('sex','age','severity'), decimals = 6) |>
  gtExtras::gt_theme_espn()
ID sex age severity medication recovery match bias_correct
1 −0.996980 0.280787 1.459800 1 31 39 4.404
2 1.002979 0.865375 1.502164 1 49 52 12.915
3 −0.996980 1.495134 1.268540 1 38 46 1.871
4 1.002979 −0.106534 0.545911 1 34 45 -0.497
5 −0.996980 0.043034 1.428732 1 30 39 2.610

Matching bias

Doesn’t this defeat the point of matching? If I have to run a linear regression anyway, why don’t I use only that, instead of this complicated model.

First of all, this linear regression that we are fitting doesn’t extrapolate on the treatment dimension to get the treatment effect. Instead, its purpose is just to correct bias.

Linear regression here is local, in the sense that it doesn’t try to see how the treated would be if it looked like the untreated. It does none of that extrapolation. This is left to the matching part.

The meat of the estimator is still the matching component. The point is that OLS is secondary to this estimator.

Matching bias

The second point is that matching is a non-parametric estimator. It doesn’t assume linearity or any kind of parametric model.

As such, it is more flexible than linear regression and can work in situations where linear regression will not, namely, those where non linearity is very strong.

Matching bias

With the bias correction formula, I get the following ATE estimation.

predicted |> 
  dplyr::summarize(
    "ATE (est)" = 
      mean( (2*medication - 1) * (recovery - match - bias_correct) )) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
ATE (est)
-7.363

Matching CI

Of course, we also need to place a confidence interval around this measurement.

In practice, we can simply use someone else’s code and just import a matching estimator. Here is one from the library Matching.

Code
require(Matching)
#  See https://www.jsekhon.com for additional documentation.
c("ATE", "ATC", "ATT") |> 
  purrr::map(
    (\(x){
      pairmatching <- Matching::Match(
        Y = data_med$recovery
        , Tr = data_med$medication
        , X = data_med |> dplyr::select('sex','age','severity')
        , estimand = x
        , BiasAdjust = TRUE
      )
      tibble::tibble(measure = x, est = pairmatching$est[1,1], se = pairmatching$se)
    })
  ) |> 
  dplyr::bind_rows() |> 
  gt::gt() |> 
  gt::fmt_number(columns = c('est','se'), decimals = 3) |>
  gt::tab_header(
    title = "Treatment Effect Estimates: Matching"
    , subtitle = "using R package Matching"
  ) |> 
  gtExtras::gt_theme_espn()
Treatment Effect Estimates: Matching
using R package Matching
measure est se
ATE −7.708 0.962
ATC −6.664 1.644
ATT −9.680 0.123

Finally, we can say with confidence that our medicine does indeed lower the time someone spends at the hospital. The ATE estimate is just a little bit lower than our algorithm, due to the difference in tie breaking of matches of knn implementation and the Matching R package.

Matching bias again

We saw that matching is biased when the unit and its match are not so similar. But what causes them to be so different?

As it turns out, the answer is quite simple and intuitive. It is easy to find people that match on a few characteristics, like sex. But if we add more characteristics, like age, income, city of birth and so on, it becomes harder and harder to find matches. In more general terms, the more features we have, the higher will be the distance between units and their matches.

This is not something that hurts only the matching estimator. It ties back to the subclassification estimator we saw earlier. In that contrived medicine example where with man and woman, it was quite easy to build the subclassification estimator. That was because we only had 2 cells: man and woman. But what would happen if we had more?

Let’s say we have 2 continuous features like age and income and we manage to discretise them into 5 buckets each. This will give us 25 cells, or \(5^2\). And what if we had 10 covariates with 3 buckets each? Doesn’t seem like a lot right? Well, this would give us 59049 cells, or \(3^{10}\). It’s easy to see how this can blow out of proportion pretty quickly. This is a phenomena pervasive in all data science, which is called the The Curse of Dimensionality!!!

Matching bias again

In the context of the subclassification estimator, the curse of dimensionality means that it will suffer if we have lots of features.

Lots of features imply multiple cells in X. If there are multiple cells, some of them will have very few data. Some of them might even have only treated or only control, so it won’t be possible to estimate the ATE there, which would break our estimator.

In the matching context, this means that the feature space will be very sparse and units will be very far from each other. This will increase the distance between matches and cause bias problems.

Matching bias again

As for linear regression, it actually handles this problem quite well.

What it does is project all the features X into a single one, the Y dimension. It then makes treatment and control comparison on that projection.

So, in some way, linear regression performs some sort of dimensionality reduction to estimate the ATE. It’s quite elegant.

Difference-in-Differences

Difference-in-Differences

Three Billboards in the South of Brazil

To figure out how good billboards were as a marketing channel, A bank placed 3 billboards in the city of Porto Alegre, the capital of the state of Rio Grande do Sul.

They wanted to see if that boosted deposits into our savings account.

Rio Grande do Sul is part of the south of Brazil, one of the most developed regions.

Data was also obtained from another capital from the south, Florianopolis, the capital city of the state of Santa Catarina in Brasil. The idea is that Florianopolis could be used as a control sample to estimate the counterfactual when compared to Porto Alegre. The billboard was placed in Porto Alegre for the entire month of June. The resulting data like this:

Code
data_bboard <- readr::read_csv("data/billboard_impact.csv", show_col_types = FALSE)

data_bboard |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gtExtras::gt_theme_espn()
deposits poa jul
42 1 0
0 1 0
52 1 0
119 1 0
21 1 0

Difference-in-Differences

Three Billboards in the South of Brazil

In this example deposits are our outcome variable, the one the bank wishes to increase with the billboards. POA is a dummy variable for the city of Porto Alegre. When it is zero, it means the samples are from Florianopolis. Jul is a dummy for the month of July, or for the post intervention period. When it is zero it refers to samples from May, the pre-intervention period.

Difference-in-Differences

DID Estimator

Let \(Y^D(T)\) be the potential outcome for treatment D on period T. In an ideal world where we have the ability to observe the counterfactual, we would estimate the treatment effect of an intervention the following way:

\[ \hat{ATET} = E[Y^1(1) - Y^0(1)|D=1] \]

In words, the causal effect is the outcome in the period post intervention in case of a treatment minus the outcome in also in the period after the intervention, but in the case of no treatment. Of course, we can’t measure this because \(Y^0(1)\) is counterfactual.

Difference-in-Differences

DID Estimator

One way around this is a before and after comparison.

\[ \hat{ATET} = E[Y(1)|D=1] - E[Y(0)|D=1] \]

In our example, we would compare the average deposits from POA before and after the billboard was placed.

Code
poa_before <- data_bboard |> dplyr::filter(poa==1 & jul==0) |> dplyr::pull(deposits) |> mean()
poa_after  <- data_bboard |> dplyr::filter(poa==1 & jul==1) |> dplyr::pull(deposits) |> mean()
poa_after - poa_before
[1] 41.05

This estimator is telling us that we should expect deposits to increase R\(\$ 41.05\) after the intervention. But can we trust this?

Difference-in-Differences

DID Estimator

Notice that \(E[Y(0)|D=1]=E[Y^0(0)|D=1]\), the observed outcome for the treated unit before the intervention is equal to the counterfactual outcome for the treated unit also before the intervention. Since we are using this to estimate \(E[Y^0(1)|D=1]\), the counterfactual after the intervention, this estimation above assumes that \(E[Y^0(1)|D=1] = E[Y^0(0)|D=1]\).

It is saying that in the case of no intervention, the outcome in the latter period would be the same as the outcome from the starting period. This would obviously be false if your outcome variable follows any kind of trend.

Difference-in-Differences

DID Estimator

For example, if deposits are going up in POA, \(E[Y^0(1)|D=1] > E[Y^0(0)|D=1]\), i.e. the outcome of the latter period would be greater than that of the starting period even in the absence of the intervention.

With a similar argument, if the trend in Y is going down, \(E[Y^0(1)|D=1] < E[Y^0(0)|D=1]\). This is to show that this before and after thing is not a great estimator.

Difference-in-Differences

DID Estimator

Another idea is to compare the treated group with an untreated group that didn’t get the intervention:

\[ \hat{ATET} = E[Y(1)|D=1] - E[Y(1)|D=0] \]

In our example, it would be to compare the deposits from POA to that of Florianopolis in the post intervention period.

Code
fl_after  <- data_bboard |> dplyr::filter(poa==0 & jul==1) |> dplyr::pull(deposits) |> mean()
poa_after - fl_after
[1] -119.1

This estimator is telling us that the campaign is detrimental and that customers will decrease deposits by R$ 119.10.

Difference-in-Differences

DID Estimator

This estimator is telling us that the campaign is detrimental and that customers will decrease deposits by R\(\$ 119.10\).

Notice that \(E[Y(1)|D=0]=E[Y^0(1)|D=0]\). And since we are using \(E[Y(1)|D=0]\) to estimate the counterfactual for the treated after the intervention, we are assuming we can replace the missing counterfactual like this: \(E[Y^0(1)|D=0] = E[Y^0(1)|D=1]\).

But notice that this would only be true if both groups have a very similar baseline level. For instance, if Florianopolis has way more deposits than Porto Alegre, this would not be true because \(E[Y^0(1)|D=0] > E[Y^0(1)|D=1]\).

On the other hand, if the level of deposits are lower in Florianopolis, we would have \(E[Y^0(1)|D=0] < E[Y^0(1)|D=1]\).

Difference-in-Differences

DID Estimator

Notice that \(E[Y(1)|D=0]=E[Y^0(1)|D=0]\). And since we are using \(E[Y(1)|D=0]\) to estimate the counterfactual for the treated after the intervention, we are assuming we can replace the missing counterfactual like this: \(E[Y^0(1)|D=0] = E[Y^0(1)|D=1]\). But notice that this would only be true if both groups have a very similar baseline level.

For instance, if Florianopolis has way more deposits than Porto Alegre, this would not be true because \(E[Y^0(1)|D=0] > E[Y^0(1)|D=1]\).

On the other hand, if the level of deposits are lower in Florianopolis, we would have \(E[Y^0(1)|D=0] < E[Y^0(1)|D=1]\).

Difference-in-Differences

DID Estimator

Again, this is not a great idea. To solve this, we can use both space and time comparison. This is the idea of the difference in difference approach. It works by replacing the missing counterfactual the following way:

\[ \begin{align*} E[Y^0(1)|D=1] & = E[Y^0(0)|D=1] + \\ & (E[Y^0(1)|D=0] - E[Y^0(0)|D=0]) \end{align*} \]

What this does is take the treated unit before the intervention and adds a trend component to it, which is estimated using the control \(E[Y^0(1)|D=0] - E[Y^0(0)|D=0]\). In words, it is saying that the treated after the intervention, had it not been treated, would look like the treated before the treatment plus a growth factor that is the same as the growth of the control.

Difference-in-Differences

DID Estimator

It is important to notice that this assumes that the trends in the treatment and control are the same:

\[ E[Y^0(1) − Y^0(0)|D=1] = E[Y^0(1) − Y^0(0)|D=0] \]

where the left hand side is the counterfactual trend. Now, we can replace the estimated counterfactual in the treatment effect definition

\(E[Y^1(1)|D=1] - E[Y^0(1)|D=1]\)

\[ \hat{ATET} = E[Y(1)|D=1] - (E[Y(0)|D=1] + (E[Y(1)|D=0] - E[Y(0)|D=0]) \]

It gets that name because it gets the difference between the difference between treatment and control after and before the treatment.

Difference-in-Differences

DID Estimator

Here is what that looks in code.

Code
fl_before <- data_bboard |> dplyr::filter(poa==0 & jul==0) |> dplyr::pull(deposits) |> mean()
diff_in_diff = (poa_after-poa_before)-(fl_after-fl_before)
diff_in_diff
[1] 6.525

Diff-in-Diff is telling us that we should expect deposits to increase by R\(\$ 6.52\) per customer.

Notice that the assumption that diff-in-diff makes is much more plausible than the other 2 estimators. It just assumes that the growth pattern between the 2 cities are the same. But it doesn’t require them to have the same base level nor does it require the trend to be zero.

Difference-in-Differences

DID Estimator

To visualize what diff-in-diff is doing, we can project the growth trend from the untreated into the treated to see the counterfactual, that is, the number of deposits we should expect if there were no intervention.

The small difference between the solid and the dashed lines shows the small treatment effect on Porto Alegre.

Difference-in-Differences

DID Estimator

But how much can we trust this estimator? To get standard errors we use a neat trick that uses regression. Specifically, we will estimate the following linear model

\[ Y_i = \beta_0 + \beta_1 POA_i + \beta_2 Jul_i + \beta_3 POA_i*Jul_i + e_i \]

Notice that \(\beta_0\) is the baseline of the control. In our case, is the level of deposits in Florianopolis in the month of May.

If we turn on the treated city dummy, we get \(\beta_1\). So \(\beta_0+\beta_1\) is the baseline of Porto Alegre in May, before the intervention, and \(\beta_1\) is the increase of Porto Alegre baseline on top of Florianopolis.

If we turn the POA dummy off and turn the July dummy on, we get \(\beta_0+\beta_2\), which is the level of Florianópolis in July, after the intervention period. \(\beta_2\) is then the trend of the control, since we add it on top of the baseline to get the level of the control at the period post intervention.

Difference-in-Differences

DID Estimator

As a recap, \(\beta_1\) is the increment we get by going from the control to the treated, \(\beta_2\) is the increment we get by going from the period before to the period after the intervention. Finally, if we turn both dummies on, we get \(\beta_3\). \(\beta_0+\beta_1+\beta_2+\beta_3\) is the level in Porto Alegre after the intervention. So \(\beta_3\) is the incremental impact when you go from May to July and from Florianopolis to POA. In other words, it is the Difference in Difference estimator.

# A tibble: 4 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   172.        2.36     72.6  0        
2 poa          -126.        4.48    -28.0  1.39e-159
3 jul            34.5       3.04     11.4  1.43e- 29
4 poa:jul         6.52      5.73      1.14 2.55e-  1

Difference-in-Differences

One obvious potential problem with Diff-in-Diff is failure to satisfy the parallel trend assumption. If the growth trend from the treated is different from the trend of the control, diff-in-diff will be biased.

This is a common problem with non-random data, where the decision to treat a region is based on its potential to respond well to the treatment, or when the treatment is targeted at regions that are not performing very well. In our marketing example we decided to test billboards in Porto Alegre not in order to check the effect of billboards in general. The reason is simply because sales perform poorly there.

Perhaps online marketing is not working there. In this case, it could be that the growth we would see in Porto Alegre without a billboard would be lower than the growth we observe in other cities. This would cause us to underestimate the effect of the billboard there.

Difference-in-Differences

One way to check if this is happening is to plot the trend using past periods. For example, let’s suppose POA had a small decreasing trend but Florianopolis was on a steep ascent. In this case, showing periods from before would reveal those trends and we would know Diff-in-Diff is not a reliable estimator.

Difference-in-Differences

One final issue that is worth mentioning is that you won’t be able to place confidence intervals around your Diff-in-Diff estimator if you only have aggregated data.

Say for instance you don’t have data on what each of our customers from Florianópolis or Porto Alegre did. Instead, you only have the average deposits before and after the intervention for both cities.

In this case, you will still be able to estimate the causal effect by Diff-in-Diff, but you won’t know the variance of it. That’s because all the variability in your data got squashed out in aggregation.

Panel Data and Fixed Effects

Panel Data and Fixed Effects

In the simple Diff-in-Diff setup, we have a treated and a control group (the city of POA and FLN, respectively) and only two periods, a pre-intervention and a post-intervention period. But what would happen if we had more periods? Or more groups?

This setup is so common and powerful for causal inference and gets its own name: panel data.

A panel is when we have repeated observations of the same unit over multiple periods of time. This is incredibly common in the industry, where companies track user data over multiple weeks and months.

To understand how we can leverage such data structure, let’s first continue with our diff-in-diff example, where we wanted to estimate the impact of placing a billboard (treatment) in the city of Porto Alegre (POA). Specifically, we want to know how much deposits in our investments account would increase if we placed more billboards.

Panel Data and Fixed Effects

In the previous section, we motivated the DiD estimator as an imputation strategy of what would have happened to Porto Alegre had we not placed the billboards in it.

We said that the counterfactual outcome \(Y^0\) for Porto Alegre after the intervention (placing a billboard) could be imputed as the number of deposits in Porto Alegre before the intervention plus a growth factor.

This growth factor was estimated in a control city, Florianopolis (FLN).

Panel Data and Fixed Effects

To recap some notation, here is how we can estimate this counterfactual outcome

\[ \begin{align*} \begin{split} \underbrace{\widehat{Y^0(1)|D=1}}_{\substack{\text{POA outcome after intervention} \\ \text{if no intervention had taken place}}} = \underbrace{Y^0(0)|D=1}_{\substack{\text{POA outcome} \\ \text{before intervention}}} + \big( \underbrace{Y^0(1)|D=0}_{\substack{\text{FLN outcome after} \\ \text{intervention in POA}}} - \underbrace{Y^0(0)|D=0}_{\substack{\text{FLN outcome before} \\ \text{intervention in POA}}} \big) \end{split} \end{align*} \]

where \(t\) denotes time, \(D\) denotes the treatment (since \(t\) is taken), \(Y^D(t)\) denote the potential outcome for treatment \(D\) in period \(t\) (for example, \(Y^0(1)\) is the outcome under the control in period 1).

Panel Data and Fixed Effects

Now, if we take that imputed potential outcome, we can recover the treatment effect for POA (ATT) as follows

\[ \begin{split} \widehat{ATT} = \underbrace{Y^1(1)|D=1}_{\substack{\text{POA outcome} \\ \text{after intervention}}} - \widehat{Y^0(1)|D=1} \end{split} \]

The effect of placing a billboard in POA is the outcome we saw on POA after placing the billboard minus our estimate of what would have happened if we hadn’t placed the billboard. Recall that the power of DiD comes from the fact that estimating the mentioned counterfactual only requires that the growth deposits in POA matches the growth in deposits in FLW. This is the key parallel trends assumption.

Panel Data and Fixed Effects

One way to see the parallel (or common) trends assumptions is as an independence assumption. Recall that the independence assumption requires that the treatment assignment is independent from the potential outcomes:

\[ Y^d \perp D \]

So we don’t give more treatment to units with higher outcome (causing upward bias in the effect estimation) or lower outcome (causing downward bias).

If the marketing manager decides to add billboards only to cities that already have very high deposits, the manager can later boast that cities with billboards generate more deposits, and that the marketing campaign was a success.

This violates the independence assumption: we are giving the treatment to cities with high \(Y^0\).

Panel Data and Fixed Effects

Recall that a natural extension of this assumption is the conditional independence assumption, which allows the potential outcomes to be dependent on the treatment at first, but independent once we condition on the confounders \(X\)

\[ Y^d \perp D | X \]

If the traditional independence assumption states that the treatment assignment can’t be related to the levels of potential outcomes, the parallel trends states that the treatment assignment can’t be related to the growth in potential outcomes over time, and one way to write the parallel trends assumption is as follows

\[ \big(Y^d(t) - Y^d(t-1) \big) \perp D \]

Panel Data and Fixed Effects

This assumption says it is fine that we assign the treatment to units that have a higher or lower level of the outcome. What we can’t do is assign the treatment to units based on how the outcome is growing.

In our billboard example, this means it is OK to place billboards only in cities with originally high deposit levels. What we can’t do is place billboards only in cities where the deposits are growing the most.

Remember that DiD is imputing the counterfactual growth in the treated unit with the growth in the control unit. If growth in the treated unit under the control is different than the growth in the control unit, then we are in trouble.

Panel Data and Fixed Effects

Controlling What you Cannot See

Methods like propensity score, linear regression and matching are very good at controlling for confounding in non-random/observational data, but they rely on a key assumption: conditional unconfoundedness

\[ (Y^0, Y^1) \perp D | X \]

To put it in words, they require that all the confounders are known and measured, so that we can condition on them and make the treatment as good as random.

Panel Data and Fixed Effects

Controlling What you Cannot See

One issue is that sometimes we can’t measure a confounder. For instance, take a classical labor economics problem of figuring out the impact of marriage on men’s earnings. It’s considered a fact in economics that married men earn more than single men. However, it is not clear if this relationship is causal or not.

It could be that more educated men are both more likely to marry and more likely to have a high earnings job, which would mean that education is a confounder of the effect of marriage on earnings. For this confounder, we could measure the education of the person in the study and run a regression controlling for that.

But another confounder could be physical attractiveness. It could be that more handsome men are both more likely to get married and more likely to have a high paying job. Unfortunately, attractiveness is one of those characteristics like intelligence. It’s something we can’t measure very well.

Panel Data and Fixed Effects

Fixed Effects

load panel data
data_panel <- readr::read_csv("data/wage_panel.csv", show_col_types = FALSE)

data_panel |> 
  dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::tab_options( table.font.size = gt::px(28) ) |>
  gtExtras::gt_theme_espn()
nr year black exper hisp hours married educ union lwage expersq occupation
13 1980 0 1 0 2672 0 14 0 1.198 1 9
13 1981 0 2 0 2320 0 14 1 1.853 4 9
13 1982 0 3 0 2940 0 14 0 1.344 9 9
13 1983 0 4 0 2960 0 14 0 1.433 16 9
13 1984 0 5 0 3071 0 14 0 1.568 25 5

Generally, the fixed effect model is defined as

\[ y_{it} = \beta X_{it} + \gamma U_i + e_{it} \]

where \(y_{it}\) is the outcome of individual \(i\) at time \(t\), \(X_{it}\) is the vector of variables for individual \(i\) at time \(t\). \(U_i\) is a set of unobservables for individual \(i\) .

Notice that those unobservables are unchanging through time, hence the lack of the time subscript.

Finally, \(e_{it}\) is the error term. For the education example, \(y_{it}\) is log wages, \(X_{it}\) are the observable variables that change in time, like marriage and experience and \(U_i\) are the variables that are not observed but constant for each individual, like beauty and intelligence.

Panel Data and Fixed Effects

Fixed Effects

Using panel data with a fixed effect model is as simple as adding a dummy for the entities.

This is true, but in practice, we don’t actually do it.

Imagine a dataset where we have 1 million customers. If we add one dummy for each of them, we would end up with 1 million columns, which is probably not a good idea. Instead, we use the trick of partitioning the linear regression into 2 separate models.

We’ve seen this before, but now is a good time to recap it.

Panel Data and Fixed Effects

Fixed Effects

Suppose you have a linear regression model with a set of features \(X_1\) and another set of features \(X_2\).

\[ \hat{Y} = \hat{\beta_1} X_1 + \hat{\beta_2} X_2 \]

where \(X_1\) and \(X_2\) are feature matrices (one row per feature and one column per observation) and \(\hat{\beta_1}\) and \(\hat{\beta_2}\) are row vectors. You can get the exact same \(\hat{\beta_1}\) parameter by doing1

  1. regress the outcome \(y\) on the second set of features \(\hat{y^*} = \hat{\gamma_1} X_2\)
  2. regress the first set of features on the second \(\hat{X_1} = \hat{\gamma_2} X_2\)
  3. obtain the residuals \(\tilde{X}_1 = X_1 - \hat{X_1}\) and \(\tilde{y} = y - \hat{y^*}\)
  4. regress the residuals of the outcome on the residuals of the features \(\tilde{y} = \hat{\beta_1} \tilde{X}_1\)

Panel Data and Fixed Effects

Fixed Effects

The parameter from this last regression will be exactly the same as running the regression with all the features.

But how exactly does this help us? Well, we can break the estimation of the model with the entity dummies into 2. First, we use the dummies to predict the outcome and the features. These are steps 1 and 2 above.

Panel Data and Fixed Effects

Fixed Effects

Recall how running a regression on a dummy variable is as simple as estimating the mean for that dummy? Let’s run a model where we predict wages as a function of the year dummy.

lm(lwage ~ year, data = data_panel |> dplyr::mutate(year = factor(year))) |> 
  broom::tidy()
# A tibble: 8 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    1.39     0.0220     63.5  0       
2 year1981       0.119    0.0311      3.84 1.22e- 4
3 year1982       0.178    0.0311      5.74 1.02e- 8
4 year1983       0.226    0.0311      7.27 4.21e-13
5 year1984       0.297    0.0311      9.56 1.94e-21
6 year1985       0.346    0.0311     11.1  1.93e-28
7 year1986       0.406    0.0311     13.1  2.19e-38
8 year1987       0.473    0.0311     15.2  4.40e-51

Panel Data and Fixed Effects

Fixed Effects

Notice how this model is predicting the average income in 1980 to be 1.3935, in 1981 to be 1.5129 (1.3935+0.1194) and so on. Now, if we compute the average by year, we get the exact same result. (Remember that the base year, 1980, is the intercept. So you have to add the intercept to the parameters of the other years to get the mean lwage for the year).

Code
data_panel |> dplyr::group_by(year) |> 
  dplyr::summarise(avg_lwage = mean(lwage)) |> 
  gt::gt('year') |> 
  gt::tab_options( table.font.size = gt::px(28) ) |>
  gtExtras::gt_theme_espn()
avg_lwage
1980 1.393
1981 1.513
1982 1.572
1983 1.619
1984 1.690
1985 1.739
1986 1.800
1987 1.866

Panel Data and Fixed Effects

Fixed Effects

This means that if we get the average for every person in our panel, we are essentially regressing the individual dummy on the other variables. This motivates the following estimation procedure:

  1. Create time-demeaned variables by subtracting the mean for the individual:

\[ \begin{align*} \ddot{Y}_{it} & =Y_{it}-\bar{Y}_{i}\\ \ddot{X}_{it} & =X_{it}-\bar{X}_{i} \end{align*} \] 2. Regress \(\ddot{Y}_{it}\) on \(\ddot{X}_{it}\)

Panel Data and Fixed Effects

Fixed Effects

Notice that when we do so, the unobserved \(U_i\) vanishes. Since \(U_i\) is constant across time, we have that \(\bar{U_i}=U_i\). If we have the following system of two equations

\[ \begin{split} \begin{align} Y_{it} & = \beta X_{it} + \gamma U_i + e_{it} \\ \bar{Y}_{i} & = \beta \bar{X}_{it} + \gamma \bar{U}_i + \bar{e}_{it} \\ \end{align} \end{split} \]

Panel Data and Fixed Effects

Fixed Effects

And we subtract one from the other, we get

\[ \begin{split} \begin{align} (Y_{it} - \bar{Y}_{i}) & = (\beta X_{it} - \beta \bar{X}_{it}) + (\gamma U_i - \gamma U_i) + (e_{it}-\bar{e}_{it}) \\ (Y_{it} - \bar{Y}_{i}) & = \beta(X_{it} - \bar{X}_{it}) + (e_{it}-\bar{e}_{it}) \\ \ddot{Y}_{it} & = \beta \ddot{X}_{it} + \ddot{e}_{it} \\ \end{align} \end{split} \]

which wipes out all unobserved that are constant across time. To be honest, not only do the unobserved variables vanish. This happens to all the variables that are constant in time. For this reason, you can’t include any variables that are constant across time, as they would be a linear combination of the dummy variables and the model wouldn’t run.

Panel Data and Fixed Effects

Fixed Effects

To check which variables are those, we can group our data by individual and get the sum of the standard deviations. If it is zero, it means the variable isn’t changing across time for any of the individuals.

Code
data_panel |> dplyr::group_by(nr) |> 
  dplyr::mutate(across(!starts_with("nr"),sd)) |> 
  dplyr::distinct() |> 
  dplyr::ungroup() |> 
  dplyr::summarise(across(!starts_with("nr"),sum)) |> 
  tidyr::pivot_longer(everything()) |> 
  gt::gt('nr') |> 
  gt::tab_options( table.font.size = gt::px(28) ) |>
  gtExtras::gt_theme_espn()
name value
year 1335.0
black 0.0
exper 1335.0
hisp 0.0
hours 203098.2
married 140.4
educ 0.0
union 106.5
lwage 173.9
expersq 17608.2
occupation 739.2

Panel Data and Fixed Effects

Fixed Effects

So, for our data, we need to remove ethnicity dummies, black and hisp, since they are constant for the individual. Also, we need to remove education. We will also not use occupation, since this is probably mediating the effect of marriage on wage (it could be that single men are able to take more time demanding positions). Having selected the features we will use, it’s time to estimate this model.

To run our fixed effect model, first, let’s get our mean data. We can achieve this by grouping everything by individuals and taking the mean.

Code
data_panel |> dplyr::group_by(nr) |>
  dplyr::select(married, expersq, union, hours, lwage) |> 
  dplyr::summarize(across(!starts_with("nr"),mean), .groups='drop') |> 
  dplyr::slice_head(n=5) |> 
  gt::gt('nr') |> 
  gt::tab_options( table.font.size = gt::px(28) ) |>
  gtExtras::gt_theme_espn()
married expersq union hours lwage
13 0.000 25.5 0.125 2808 1.256
17 0.000 61.5 0.000 2504 1.638
18 1.000 61.5 0.000 2350 2.034
45 0.125 35.5 0.250 2226 1.774
110 0.500 77.5 0.125 2108 2.055

Panel Data and Fixed Effects

Fixed Effects

To demean the data, we need to set the index of the original data to be the individual identifier, nr. Then, we can simply subtract one data frame from the mean data frame.

married expersq union hours lwage
13 0 -24.5 -0.125 -135.6 -0.05811
13 0 -21.5 0.875 -487.6 0.59741
13 0 -16.5 -0.125 132.4 0.08881
13 0 -9.5 -0.125 152.4 0.17756
13 0 -0.5 -0.125 263.4 0.31247

Panel Data and Fixed Effects

Fixed Effects

Finally, we can run our fixed effect model on the time-demeaned data.

Code
lm(lwage ~ married + expersq + union + hours, data = data_panel_baked) |> 
  broom::tidy()
# A tibble: 5 × 5
  term         estimate std.error statistic   p.value
  <chr>           <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept) -4.71e-17 0.00507   -9.28e-15 1.00e+  0
2 married      1.15e- 1 0.0170     6.76e+ 0 1.61e- 11
3 expersq      3.95e- 3 0.000180   2.20e+ 1 1.92e-101
4 union        7.84e- 2 0.0184     4.26e+ 0 2.08e-  5
5 hours       -8.46e- 5 0.0000125 -6.74e+ 0 1.74e- 11

Panel Data and Fixed Effects

Fixed Effects

If we believe that fixed effect eliminates the all omitted variable bias, this model is telling us that marriage increases a man’s wage by 11%. This result is very significant. One detail here is that for fixed effect models, the standard errors need to be clustered. So, instead of doing all our estimation by hand (which is only nice for pedagogical reasons), we can use the library linearmodels and set the argument cluster_entity to True.

Code
# see https://www.r-bloggers.com/2021/05/clustered-standard-errors-with-r/
estimatr::lm_robust(
  lwage ~ expersq+union+married+hours 
  , data = data_panel_baked
  , fixed_effects = ~nr ) |> broom::tidy()
     term   estimate std.error statistic   p.value
1 expersq  0.0039509 1.861e-04    21.232 1.208e-94
2   union  0.0784442 1.991e-02     3.940 8.287e-05
3 married  0.1146543 1.825e-02     6.281 3.739e-10
4   hours -0.0000846 1.842e-05    -4.593 4.512e-06
    conf.low  conf.high   df outcome
1  0.0035861  4.316e-03 3811   lwage
2  0.0394117  1.175e-01 3811   lwage
3  0.0788661  1.504e-01 3811   lwage
4 -0.0001207 -4.849e-05 3811   lwage

Panel Data and Fixed Effects

Fixed Effects

Notice how the parameter estimates are identical to the ones we’ve got with time-demeaned data. The only difference is that the standard errors are a bit larger. Compare this to the simple OLS model that doesn’t take the time structure of the data into account. For this model, we add back the variables that are constant in time.

# A tibble: 8 × 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  0.265     0.0647        4.10  4.16e-  5
2 expersq      0.00324   0.000206     15.7   2.14e- 54
3 union        0.183     0.0173       10.6   6.27e- 26
4 married      0.141     0.0158        8.93  6.11e- 19
5 hours       -0.0000532 0.0000134    -3.98  7.05e-  5
6 black       -0.135     0.0237       -5.68  1.44e-  8
7 hisp         0.0132    0.0210        0.632 5.28e-  1
8 educ         0.106     0.00469      22.6   1.38e-106

This model is saying that marriage increases the man’s wage by 14%. A somewhat larger effect than the one we found with the fixed effect model. This suggests some omitted variable bias due to fixed individual factors, like intelligence and beauty, not being added to the model.

More

Recap

We looked in some detail at other methods used to estimate causal effects, along with methods used for special situations, including:

  • regression adjustment
  • doubly robust estimation
  • matching
  • difference in differences
  • fixed effects and methods for panel data