# check if 'librarian' is installed and if not, install itif (!"librarian"%in%rownames(installed.packages()) ){install.packages("librarian")}# load packages if not already loadedlibrarian::shelf( tidyverse, broom, rsample, ggdag, causaldata, halfmoon, ggokabeito, malcolmbarrett/causalworkshop , magrittr, ggplot2, estimatr, Formula, r-causal/propensity, gt, gtExtras, timetk, modeltime)# set the default theme for plottingtheme_set(theme_bw(base_size =18) +theme(legend.position ="top"))
Part 1
Q-1
In the context of time series, partial autocorrelation measures are:
SOLUTION (1 point) :
The correlation between two variables, removing the effect of intervening variables
Q-2
In a causal DAG, a confounder is:
SOLUTION (1 point) :
A variable that influences both the cause and effect
Q-3
Stationarity in time series analysis means that:
SOLUTION (1 point) :
The series has a constant mean and variance over time
Q-4
For the binary classifier with the confusion matrix below:
The precision of this binary classifier is approximately:
SOLUTION (1 point) :
0.74
Precision is calculated as:
where:
True Positives (TP): The number of correctly predicted positive instances.
False Positives (FP): The number of instances incorrectly predicted as positive.
Alternatively, precision is the number of true positives out of all positives predicted. Here the number of true positives is 45, and the mnumber of positives predicted is 60, so the precision is 45/60 = 0.75
Q-5
Which distance metric is not commonly used in kNN classifiers?
SOLUTION (1 point) :
Hamming distance
The Hamming distance measures the number of positions at which the corresponding elements of two vectors differ. It is primarily used for comparing categorical, binary, or string data, where it calculates the count of mismatches between two vectors.
Hamming distance is specialized for exact matches in binary or categorical data, which is rare in most kNN applications that rely on numerical distances in continuous space.
Q6
In causal DAGs, what does a directed edge represent?
SOLUTION (1 point) :
Causation
Q7
How does the kNN algorithm typically perform on very large datasets?:
SOLUTION (1 point) :
It becomes slower due to the increased computation of distances
Q8
How many open paths are in the DAG above?
SOLUTION (1 point) :
4
A path is open if it has no non-collider nodes that have been conditioned on.
If a collider exists along the path, conditioning on the collider (or its descendants) can open the path.
# define the DAG coordinates coord_dag <-list(x =c(z2 =0, z1 =0, z3 =-1, x =1, y =2),y =c(z2 =-1, z1 =1, z3 =0, x =0, y =0) )# specify the DAG dag <- ggdag::dagify( y ~ x + z1 + z2 + z3, x ~ z3 + z1 + z2,coords = coord_dag,labels = labels,exposure ="x",outcome ="y" )# find the paths and indicate which are open dag |> dagitty::paths()
What is the purpose of introducing a soft margin in a SVM?
SOLUTION (1 point) :
To allow for a certain degree of misclassification in the training data
In Support Vector Machines (SVM), a soft margin is introduced to allow for some misclassification of data points in order to handle cases where the data is not perfectly linearly separable. The soft margin concept relaxes the strict requirement of a hard margin (where all points must lie on the correct side of the margin), allowing the SVM to create a more flexible and generalizable decision boundary.
Q10
Which of the following is not a typical component of time series data?
SOLUTION (1 point) :
Delete the wrong answer(s) below
Trend
Seasonality
Cyclical
Part 2
Q11
This question uses data for the closing prices of the five major Canadian banks from 2005-08-10 to 2023-09-29. The data was obtained using the following code (the difference in the time range is due to elimination of rows with NA values:
tidyquant::tq_get(c("TD","BMO","BNS","RBC","CM") , get ="stock.prices" , from ="2000-01-01" , to ="2023-10-01")
(1) Plot the data using functions in the timetk package (0.5 point)
SOLUTION :
# A PLOT OF THE CLOSING PRICES FOR THE FIVE MAJOR CANADIAN BANKSarima_data %>% tidyr::pivot_longer(-date) %>% timetk::plot_time_series(.date_var = date , .value = value , .color_var = name , .smooth = F )
The goal is to build and evaluate an arima model to predict the stock price of CIBC (symbol ‘CM’), using the workflow we developed in class.
(2) Create test/trains splits of the data, where the initial period is 10 years and the assessment period is 1 year. Plot the test/train series for CIBC (symbol ‘CM’). (0.5 point)
SOLUTION :
# DEFINITION AND A PLOT (CM) OF TEST & TRAINING SPLITS OF THE DATAsplits <- timetk::time_series_split(data = arima_data , initial ="10 year" , assess ="1 year" )splits %>% timetk::tk_time_series_cv_plan() %>% timetk::plot_time_series_cv_plan(.date_var = date, .value = CM)
(3) Define a data preprocessing recipe and a model definition. The recipe is based on the formula CM ~ ., and make sure the data argument uses the training data. The model engine should be auto_arima.
Finally, create a workflow object containing the recipe and the model spec, and then fit the model using the training data. (1 point)
SOLUTION :
# A RECIPEtime_rec <- arima_data %>% recipes::recipe( CM ~ ., data = rsample::training(splits))# A MODEL SPECIFICATIONmodel_spec_arima <- modeltime::arima_reg() %>% parsnip::set_engine("auto_arima")# A FITTED WORKFLOWworkflow_fit_arima <- workflows::workflow() %>% workflows::add_recipe(time_rec) %>% workflows::add_model(model_spec_arima) %>% parsnip::fit(rsample::training(splits))
frequency = 5 observations per 1 week
(4) Create a models table with your fitted model and a calibration table that uses the testing data. Generate a forecast with the testing data and the original arima_data. Plot the forecast. (1 point)
SOLUTION :
# A MODELS TABLEmodels_tbl <- modeltime::modeltime_table(workflow_fit_arima)# A CALIBRATION TABLEcalibration_tbl <- models_tbl %>% modeltime::modeltime_calibrate(new_data = rsample::testing(splits))# PLOT OF THE FITTED MODEL FORECAST OF THE TRAINING DATA calibration_tbl %>% modeltime::modeltime_forecast(new_data = rsample::testing(splits),actual_data = arima_data ) %>% modeltime::plot_modeltime_forecast(.legend_max_width =25, # For mobile screens.interactive =TRUE )
(5) Compute the accuracy metrics for the forecast. What is the (rsq) metric. (1 point)
The rsq metric for the fit of the arima model to the testing data is: 0.65, as can be read off the table.
Q12
Execute the following code to create simulated observational data, where D is the treatment variable and Y is the response variable.
set.seed(8740)n <-800V <-rbinom(n, 1, 0.2)W <-3*V +rnorm(n)D <- V +rnorm(n)Y <- D + W^2+1+rnorm(n)Z <- D + Y +rnorm(n)data_obs <- tibble::tibble(V=V, W=W, D=D, Y=Y, Z=Z)
In the code below we fit several different outcome models. Compare the resulting coefficients for D. Which regressions appear to lead to unbiased estimates of the causal effect? (1.5 points)
# linear model of Y on Xlin_YX <-lm(Y ~ D, data=data_obs)# linear model of Y on X and Vlin_YV <-lm(Y ~ D + V, data=data_obs)# linear model Y on X and Wlin_YW <-lm(Y ~ D + W, data=data_obs)
List all valid adjustment sets for the causal structure in this data (a good first step is to sketch the causal relations between variables - you don’t need ggdag::dagify). (1.5 points)
SOLUTION :
# extract the coefficients for each regressiontibble::tibble(model =c('lin_YX','lin_YV','lin_YW') , m =list(lin_YX,lin_YV,lin_YW )) %>% dplyr::mutate(D_coefficient = purrr::map_dbl( m , function(s){ s %>% broom::tidy() %>% dplyr::filter(term =="D") %>% dplyr::pull(estimate) } ) )
# A tibble: 3 × 3
model m D_coefficient
<chr> <list> <dbl>
1 lin_YX <lm> 2.27
2 lin_YV <lm> 0.921
3 lin_YW <lm> 1.30
Regressions that appear to lead to unbiased estimates of the causal effect: (lin_YV and lin_YW are less biased than linYX)
From the code used to create the data, the coefficient of D is 1. lin_YV (controlling for V) gives a more accurate estimate, while lin_YW also finds an estimate close to the correct value.
Valid adjustment sets for the data used in this question:
# specify the DAG for this question, based on the data generation mechanismQ10dag <- ggdag::dagify( W ~ V , D ~ V , Y ~ D + W , Z ~ D + Y , exposure ="D" , outcome ="Y")
Here both variables V and W will block the open backdoor path from D -> Y. Note that the Dag doesn’t capture the quadratic relationship of W to Y, just that Y depends on W. The quadratic relationship is why lin_YW doesn’t product a great estimate.
In practice you would note the difference between lin_YV and lin_YW and experiment with different regression specifications, e.g.:
lm(Y ~ D +I(W^2), data=data_obs)
Call:
lm(formula = Y ~ D + I(W^2), data = data_obs)
Coefficients:
(Intercept) D I(W^2)
0.9833 0.9800 1.0107
Q13
For this question we’ll use the Spam Classification Dataset available from the UCI Machine Learning Repository. It features a collection of spam and non-spam emails represented as feature vectors, making it suitable for a logistic regression model. The data is in your data/ directory and the metadata is in the data/spambase/ directory.
We’ll use this data to create a model for detecting email spam using logistic regression.
(1) Split the data into test and training sets, and create a default recipe and a default model specification. Use the glmnet engine for the model, with penalty = 0.05 & mixture = 0.5. (1 point)
(2) create a default workflow object with the recipe and the model specification, fit the workflow using parnsip::fit and the training data, and then generate the testing results by applying the fit to the testing data using broom::augment . (1 point)
(4) Is there a way you could improve the accuracy of this model? (1 point)
SOLUTION :
This model could be made more accurate by tuning the meta parameters of the logistic model.
Q14
When preprocessing data for time series models, what is the function timetk::step_fourier() used for? (1 point)
Give an example of its use in a recipe that is engineered by use with weekly data records. (1 point)
SOLUTION:
The timetk::step_fourier() function is used for adding seasonality to variable(s) via sinusoidal features.
An example of its use in a recipe that is engineered for use with weekly data records is:
# Sample weekly dataset.seed(123)weekly_data <-tibble(date =seq(as.Date("2023-01-01"), by ="week", length.out =52),sales =100+10*sin(2* pi *seq(1, 52) /52) +rnorm(52, sd =5))# Create a reciperec <- recipes::recipe(sales ~ date, data = weekly_data) %>%# Add Fourier terms for seasonality with a weekly frequency# Use 1 or 2 harmonics for weekly seasonality depending on desired complexity timetk::step_fourier(date, period =12, K =2) # period = 12 weeks (quarterly seasonality for weekly data)# Prepare and bake the recipe to see the engineered featuresrec_prep <- recipes::prep(rec)weekly_data_with_fourier <- recipes::bake(rec_prep, new_data =NULL)weekly_data_with_fourier
In a paper in the prestigious Proceedings of the National Academy of Science (PNAS) earlier this year:
Note
S. A. Rains, A. S. Richards, US state vaccine mandates did not influence COVID-19 vaccination rates but reduced uptake of COVID-19 boosters and flu vaccines compared to bans on vaccine restrictions. Proc. Natl. Acad. Sci. U.S.A. 121(8), e2313610121 (2024).
Rains & Richards performed a causal analysis and found that compared to states that banned COVID-19 vaccination requirements, states that imposed COVID-19 vaccination mandates exhibit lower adult and child uptake of flu vaccines and lower uptake of COVID-19 boosters. They included their data and their code (in R), as is best practice.
In their analysis, the treatment was binary (vaccine mandate (1) or ban (0)). The proportion of people in a state that had been vaccinated was included to account for the general inclination toward COVID-19 vaccination in a state (mean centered). The outcome variable reflected the proportion of eligible people in a state who had received a booster or flu shot.
However in a letter to the PNAS on September 30, 2024 , the author of the letter, Jack Fitzgerald, argued that Rains & Richards had included a bad control in their analysis, a variable that biased their results.
Note
Fitzgerald, J.US states that mandated COVID-19 vaccination see higher, not lower, take-up of COVID-19 boosters and flu vaccines. Proc. Natl. Acad. Sci. U.S.A. 121(41), e2403758121 (2024).
Here is Fitzgerald’s DAG from his letter:
SOLUTION (2 points):
Fitzgerald thought that the vaccination rate variable was the bad control, since it is a collider.
Controlling for a collider opens up the backdoor path from the treatment to the outcome through unobserved factors, for example vaccine hesitancy, and that this backdoor would bias the results. He re-ran the analyses without controlling for vaccination rate and found that the effect changed sign, from reducing booster uptake to increasing booster uptake.