# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, broom, rsample, ggdag, causaldata, magrittr, gt, gtExtras, halfmoon, ggokabeito,
ggplot2, survey, cardx, gtsummary,brms, shinystan, patchwork, tidybayes, ggthemes,-franke/aida-package, michael-franke/faintr, CogSciPrag/cspplot
michael
)
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))
lab 10 - Bayesian Methods
BSMM 8740 Fall 2024
Introduction
In today’s lab, you’ll practice creating Bayesian models.
Getting started
To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-10-[your github username] repository .
Create an R project using your 2024-lab-10-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the
2024-lab-10.qmd
file in your repository.When you are done, be sure to: save your document, stage, commit and push your work.
To access Github from the lab, you will need to make sure you are logged in as follows:
- username: .\daladmin
- password: Business507!
Remember to (create a PAT and set your git credentials)
- create your PAT using
usethis::create_github_token()
, - store your PAT with
gitcreds::gitcreds_set()
, - set your username and email with
usethis::use_git_config( user.name = ___, user.email = ___)
Packages
Exercise 1:
Data:
This data contains roughly 2000 trials of a mouse-tracking experiment,
It is a preprocessed data set from an experiment conducted by Kieslich et al. (2020) in which participants classified specific animals into broader categories. The data set contains response times, MAD, AUC and other attributes as well as all experimental conditions.
<- aida::data_MT
dolphin
# aggregate
<- dolphin |>
dolphin_agg ::filter(correct == 1) |> # only correct values
dplyr::group_by(subject_id) |>
dplyr::summarize( # use the median AUC/MAD
dplyrAUC = median(AUC, na.rm = TRUE),
MAD = median(MAD, na.rm = TRUE)) |>
::ungroup() |>
dplyr::mutate(
dplyr# the function scale centers and scales the data
AUC = scale(AUC),
MAD = scale(MAD)
)
# let's take a look
head(dolphin_agg)
# A tibble: 6 × 3
subject_id AUC[,1] MAD[,1]
<dbl> <dbl> <dbl>
1 1001 0.643 0.593
2 1002 0.732 0.367
3 1003 -0.839 -0.833
4 1004 -0.551 -0.535
5 1005 0.619 0.436
6 1006 0.748 1.02
Plot AUC vs MAD. What does the relationship show? Does the relationship make sense?
Regress AUC against MAD using a bayesian regression with the dolphin_agg data. Save the results in the fits directory as “model1.” Use all the brms defaults
Extract the model coefficients, then redo the graph to add a geom_abline
with the model slope and intercept.
Use tidybayes::get_variables to get the model variables, and then use tidybayes::spread_draws with the model1 to get the draws for “b_MAD” and “b_Intercept” as a tibble, bound to the variable posteriors1.
Repeat the last operation, but this time use ndraws = 100 to limit the number of draws (call this posterior2). Again using the first graph, add a line with intercept and slope corresponding to each row of posterior2 (i.e. add 100 lines).
Use model1 and tidybayes::gather_draws, to get the draws for “b_MAD” as a tibble in long form, bound to the variable posteriors3. Rename ‘.variable’ to ‘parameter’ and ‘.value’ to ‘posterior’ and then keep only those two columns.
Calculate the mean, the lower and the upper bound of a 90% CrI, using the function tidybayes::hdi().
Exercise 2: elasticity
In this exercise we will estimate price elasticity. The dataset contains price and monthly unit-volume data for four sites over 12 months. Since the unit-volumes are counts of unit sold, we’ll need a discrete pmf to model the data generation process.
# read the data
<-
dat ::read_csv("data/price_data.csv",show_col_types = FALSE) |>
readr::filter(site == "Windsor")
dplyr# plot the data
|>
dat ggplot(aes(x=price, y=volume, group=site)) +
geom_point()
Since elasticity is defined as the percentage change in volume () for a given percentage change in price (), then with elasticity parameter we write:
This equation is the justification for the log-log regression model of elasticity, and this model has solution , where is a constant.
As written, the value of is either the volume when which may or may not be useful, or it is the volume when , which is uninteresting.
To make the interpretation of the constant more useful, the model can be written as
in which case the constant is interpreted as the volume when the price equals the baseline price; the elasticity parameter is unchanged.
The implies that our regression model should be given the link in the Poisson model. To analyze the data
scale and transform the price data by dividing by the mean price and then taking the log. Assign the modified data to the variable
dat01
.build and fit a Bayesian Poisson model with
- a normal(0,5) prior on the intercept, and
- a cauchy(0,10) prior on the independent variable with upper bound 0 (to ensure that elasticities are negative).
- assign the fit to the variable
windsor_01
and save the fit in the folder “fits/windsor_01” - comment on the posterior distributions in the pairs plot
once the model is fit, summarize the draws and show the the pairs plot
In a Poisson model, the mean and variance of the dependent variable are equal. This is clearly not the case here (check this). So we might not expect the Poisson generating process to be a good fit.
An alternative discrete pmf to the Poisson data generation process is the negative binomial process.
- build and fit a Bayesian Negative Binomial model (family = negbinomial()) using the same priors as in the Poisson model
- assign the fit to the variable
windsor_02
and save the fit in the folder “fits/windsor_02” - once the model is fit, summarize the draws and show the the pairs plot
- comment on the posterior distributions in the pairs plot
Since we have discrete outcomes, the continuous distribution that is the default output of brms::pp_check
is not appropriate here.
Instead use the type = “rootogram” argument of brms::pp_check
to plot posterior predictive checks for the Poisson and NB models.
Rootograms graphically compare frequencies of empirical distributions and expected (fitted) probability models. For the observed distribution the histogram is drawn on a square root scale (hence the name) and superimposed with a line for the expected frequencies.
Finally, compare the two model fits using brms::loo
and brms::loo_compare
and comment on the model comparison based on all the analyses performed in this exercise.
Exercise 3:
Consider the a complicated manufacturing process as follows, where production is one unit per day on average and manufacturing equipment maintenance takes 20% of the time on average.
# define parameters
<- 0.2 # 20% of days
prob_maintenance <- 1 # average 1 unit per day
rate_work
# sample one year of production
<- 365
n
# simulate days of maintenance
set.seed(365)
<- rbinom(n, size = 1, prob = prob_maintenance)
maintenance
# simulate units completed
<- (1 - maintenance) * rpois(n, lambda = rate_work)
y
<-
dat ::tibble(maintenance = factor(maintenance, levels = 1:0), y = y)
tibble
%>%
dat ggplot(aes(x = y)) +
geom_histogram(aes(fill = maintenance),
binwidth = 1, linewidth = 1/10, color = "grey92") +
scale_fill_manual(values = ggthemes::canva_pal("Green fields")(4)[1:2]) +
xlab("Units completed") +
theme(legend.position = "none")
With this data, the likelihood of observing zero on y, (i.e., the likelihood zero units were completed on a given day) is
And the likelihood of a non-zero value is:
So letting be the probability that is zero and be the shape of the distribution, the zero-inflated Poisson (ZIPoisson) regression model might take the basic form:
where both parameters in the likelihood, and might get their own statistical model. In brms, is denoted by zi.
Create a Bayesian zero-inflated Poisson Model (family = zero_inflated_poisson) with a normal(1,0.5) normal prior on the intercept and a beta(2,6) prior on zi (class = zi) and save the model in “fits/zi_model”
- use print() to display a summary of the model, and
- noting the link functions in the second line of the model summary, express the parameters a probabilities.
Exercise 4:
Load the regional sales data below
<- readr::read_csv("data/regional_sales.csv", show_col_types = FALSE) dat
Build the following sales model. How are the terms (1|region_id) and (1|store_id) referred to in the output, and how do you interpret them.
Create a tibble and bind it to the variable new_store_data. Give it the following columns:
- store_size = 5000
- competition_density = 5
- marketing_spend = 10000
- seasonal_effect = 0
- region_id = 1
- store_id = max(data$store_id) + 1
Use the model and this data to predict sales for the new store and give the confidence intervals for the prediction.
You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 6!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.
I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)
Grading
Total points available: 30 points.
Component | Points |
---|---|
Ex 1 - 4 | 30 |