Week 3 - Regression Methods

Important

Due date: Lab 3 - Sunday, Sept 29, 5pm ET

Prepare

📖 Read Chapter 2 - General Aspects of Fitting Regression Models in: Regression Modeling Strategies

📖 Read Chapter (8.1-8.5) - Regression Models in: Modern Statistics in R

📖 Follow along with the R code in Linear Regression in R: Linear Regression Hands on Tutorial

📖 Follow along with the R code from R code for Regression Analysis in An R companion

📖 Check out Regression and Other Stories - Examples: Regression and Other Stories Examples

Participate

🖥️ Lecture 3 - Regression Methods

Perform

⌨️ Lab 3 - Regression Methods

⌨️ Example 1: grouped data & weighted regression

From Section 10.8 of Regression and Other Stories:

Three models leading to weighted regression

Weighted least squares can be derived from three different models:

Using observed data to represent a larger population. This is the most common way that regression weights are used in practice. A weighted regression is fit to sample data in order to estimate the (unweighted) linear model that would be obtained if it could be fit to the entire population. For example, suppose our data come from a survey that oversamples older white women, and we are interested in estimating the population regression. Then we would assign to survey respondent a weight that is proportional to the number of people of that type in the population represented by that person in the sample. In this example, men, younger people, and members of ethnic minorities would have higher weights. Including these weights in the regression is a way to approximately minimize the sum of squared errors with respect to the population rather than the sample.
Duplicate observations. More directly, suppose each data point can represent one or more actual observations, so that i represents a collection of w_i data points, all of which happen to have x_i as their vector of predictors, and where y_i is the average of the corresponding wi outcome variables. Then weighted regression on the compressed dataset, (x, y, w), is equivalent to unweighted regression on the original data.
Unequal variances. From a completely different direction, weighted least squares is the maximum likelihood estimate for the regression model with independent normally distributed errors with unequal variances, where sd(ε_i) is proportional to 1/√w_i . That is, measurements with higher variance get lower weight when fitting the model. As discussed further in Section 11.1, unequal variances are not typically a major issue for the goal of estimating regression coefficients, but they become more important when making predictions about individual cases.

We will use weighted regression later in the course (Lectures 7 & 8), using observed data to represent a larger population - case 1 above.

Here’s an example of the second case:

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(dplyr, broom)


  The 'cran_repo' argument in shelf() was not set, so it will use
  cran_repo = 'https://cran.r-project.org' by default.

  To avoid this message, set the 'cran_repo' argument to a CRAN
  mirror URL (see https://cran.r-project.org/mirrors.html) or set
  'quiet = TRUE'.

set.seed(1024)

# individual (true) dataset, with 100,000 rows
x <- round(rnorm(1e5))
y <- round(x + x^2 + rnorm(1e5))
ind <- data.frame(x, y)

# aggregated dataset: grouped
agg <- ind %>%
  dplyr::group_by(x, y) |> 
  dplyr::summarize(freq = dplyr::n(), .groups = 'drop') 

models <- list( 
  "True"                = lm(y ~ x, data = ind),
  "Aggregated"          = lm(y ~ x, data = agg),
  "Aggregated & W"      = lm(y ~ x, data = agg, weights=freq)
)

models[['True']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1 (Intercept)     1.08   0.00580      187.       0    1.07       1.10
2 x               1.01   0.00558      181.       0    0.998      1.02

models[['Aggregated']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    5.51      0.717      7.69 8.74e-11    4.08       6.95
2 x              0.910     0.302      3.01 3.69e- 3    0.306      1.51

models[['Aggregated & W']] |> broom::tidy(conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic    p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
1 (Intercept)     1.08     0.224      4.84 0.00000795    0.637      1.53
2 x               1.01     0.216      4.68 0.0000145     0.579      1.44

Note the differences in the coefficient estimate for $x$ and the corresponding standard errors.

Back to course schedule ⏎