Last week we looked at causal effect estimation methods in greater depth, including
inverse probaiity weighting
regression adjustment
doubly robust estimation
matching estimation
Difference in Differences (DiD)
Panel Data and fixed effects (FE)
This week
We will explore Monte Carlo methods as a way to integrate difficult functions, and to sample from difficult probability distributions.
Along the way we will look at Markov Chains, which can both underlie sampling methods and provide a way to model the generation of data.
Finally, we look at Markov Chain Monte Carlo (MCMC) methods to sample from probability distributions.
Monte Carlo (MC) Methods
Monte Carlo (MC) Methods
Monte Carlo methods are a class of simulation-based methods that seek to avoid complicated and/or intractable mathematical computations.
Especially those that arise from probability distributions.
First we consider how ‘random numbers’ are computer generated.
Random Number Generation
Random numbers from a given distribution are drawn using random numbers from a (\([0,1]\)) uniform distribution, using the inverse transform sampling method as follows:
Uniform Distribution: Start by drawing a random number (\(U\)) from a (\([0,1]\)) uniform distribution. This gives us a random variable that is uniformly distributed between \(0\) and \(1\).
Cumulative Distribution Function (CDF): Every probability distribution has a CDF (\(F_X(x)\)), which gives the probability that a random variable (\(X\)) is less than or equal to (\(x\)).
Inverse CDF: The key idea is that if (\(U\)) is uniformly distributed in (\([0,1]\)), then (\(F_X^{-1}(U)\)) (the inverse of the CDF) will give us a random sample from the distribution of (\(X\)).
Random Number Generation
Process:
Draw a random number (\(U\)) from the uniform (\([0,1]\)) distribution.
Use the inverse CDF of the desired distribution: (\(X = F_X^{-1}(U)\)).
The result (X) is a random number that follows the target distribution.
Example: For an exponential distribution with rate parameter (\(\lambda\)), the CDF is\(F_X(x) = 1 - e^{-\lambda x}\). To generate a random draw from this distribution: Draw \(U \sim \mathcal{U}(0,1)\), then use the inverse of the CDF: (\(X = -\frac{1}{\lambda} \ln(1 - U)\)), to get an exponentially distributed random variable.
This method applies broadly to any distribution where you can compute or approximate the inverse CDF.
Random Number Generation
Pseudo-random generators found in computers use a deterministic (i.e. repeatable) algorithm to generate a sequence of (apparently) random numbers on the \((0, 1)\) interval.
What defines a good random number generator (RNG) is a long period – how long it takes before the sequence repeats itself. A period of \(2^{32}\) is not enough (need at least \(2^{40}\)). Various statistical tests exist to measure “randomness” – well validated software will have gone through these checks.
Random Number Generation
Alternatives to inverse transform sampling are special algorithms that generate random samples from particular distributions, e.g. one way to sample from a Gaussian random number generation uses the Box-Muller method:
generate two independent uniform \([0, 1]\) random variables, and
use some trigonometry.
Very important: never write your own generator, always use a well validated generator from a reputable source (e.g. R).
Random Number Generation
Recall that \(\mathscr{N}(0, 1)\) Normal random variables (mean 0, variance 1) have the probability density function:
\[
p(x)=\frac{1}{2\pi}e^{-\frac{1}{2}x^2}\equiv\phi(x)
\] and if \(X\sim\mathscr{N}(0, 1)\) then its CDF is:
Warning: The dot-dot notation (`..density..`) was deprecated in
ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Normal y1 vs Normal y2; independent random RVs
Random Number Generation
Your computer is only capable of producing pseudorandom numbers. These are made by running a pseudorandom number generator algorithm which is deterministic, e.g.
Once the RNG seed is set, the “random” numbers that R generates aren’t random at all. But someone looking at these random numbers would have a very hard time distinguishing these numbers from truly random numbers.
MC Methods
We can use MC methods to estimate probabilities: for a random variable \(Z\) with outcomes in a set \(\Omega\), given a subset \(S\subset\Omega\) and defining an event \(E\equiv Z\in S\), we can compute the probability of \(E\) (i.e. \(\mathbb{P}(E)\)) with of samples of \(Z\), say \(z_1,z_2,\ldots,z_M\) as
If \(Z\sim\mathscr{N}(1,3)\) and \(S=Z:0\le Z\le 3\) then
\[
\mathbb{P}(E) = \mathbb{P}(0\le Z\le 3) = \int_0^3\frac{1}{\sqrt{2\pi3}}e^{-\frac{(t-1)^2}{2*3}}dt
\] In which case it is easier to just use R and calculate:
and the variance of \(h(Z)\) decreases as \(1/M\) by the same reasoning.
Moments and the MGF
The \(k\)-th moment of a random variable \(X\) is the expected value of: \(X\) raised to the \(k\)-th power.
\[
\acute{\mu}_k(X)=\mathbb{E}[X^k]
\] which is known as the \(k\)-th raw moment.
The \(k\)-th moment of a random variable \(X\) around some value \(c\) is known as the \(k\)-th central moment of \(X\). It is the \(k\)-th raw moment of \(X-c\).
\[
\mu_k(X)=\mathbb{E}[(X-c)^k]
\]
The \(k\)-th standardized moment of a random variable \(X\) is the \(k\)-th central moment of \(X\) divided by the \(k\)-th power of the standard deviation of \(X\).
\[
\begin{align*}
M_{\bar{Z}_{n}}(t) & =\left(M_{Z/\sqrt{n}}(t)\right)^{n}=\left(M_{Z}(t/\sqrt{n})\right)^{n}\\
M_{Z}(t/\sqrt{n}) & =1+0+\frac{\left(t/\sqrt{n}\right)^{2}}{2!}+\sum_{k=3}^{\infty}M_{Z}^{(k)}(0)\frac{\left(t/\sqrt{n}\right)^{k}}{k!}
\end{align*}
\] Now \(M_{Z}(t/\sqrt{n}) = T_2(t/\sqrt{n}) + R_2(t/\sqrt{n})\) where \(T_2\) is the Taylor series up to order 2, and \(R_2\) is the residual - the Taylor series terms greater than order 2.
Central Limit Theorem
\[
\lim_{n\rightarrow\infty}\frac{R_2(t/\sqrt{n})}{(t/\sqrt{n})^2}=\lim_{n\rightarrow\infty}nR_2(t/\sqrt{n})=0
\] and
i.e. the sum of \(n\) IID samples of a random variable converges to a Gaussian in the limit \(n\rightarrow\infty\).
Central Limit Theorem
The CLT says that for \(f=\mathbb{P}(E)\) if \(\sigma^2\equiv\mathrm{Var}\left(f\right)\) is finite then the error of the MC estimate
\[
e_N(f)=\bar{f}-\mathbb{E}[f]
\]
is approximately Normal in distribution for large \(M\), i.e.
\[
e_N(f)\sim\sigma M^{1/2}Z
\] where \(Z\sim\mathscr{N}(0,1)\)
MC Methods
Suppose we need to compute an expectation \(\mathbb{E}[h(Z)]\) for some random variable \(Z\) and some function \(h:\mathbb{R}\to\mathbb{R}\).
Monte Carlo methods avoid doing any integration and instead just generate many samples of \(Z\), say \(z_1,z_2,\ldots,z_M\) and estimate\(\mathbb{E}[h(Z)]\) as \(\frac{1}{M}\sum_{i=1}^Mh(z_i)\), relying on empirical probability. The law of large numbers states that this sample mean should be close to \(\mathbb{E}[h(Z)]\).
Monte Carlo replaces the work of computing an integral (i.e., an expectation) with the work of generating many random variables.
MC Methods
sampling
If \(X\sim\mathscr{N}(\mu,\sigma)\) and we want to compute \(\mathbb{E}[\log|X|]\), we could set up and solve the integral
Alternatively, we could just draw lots of Monte Carlo replicates \(X_1,X_2,\cdots,X_M\) from a normal with mean \(\mu\) and variance \(\sigma^2\), and look at the sample mean \(M^{-1}\sum_{i=1}^M\log|x_i|\), once again appealing to the law of large numbers to ensure that this sample mean is close to its expectation.
MC Methods
sampling
Suppose that we want to compute an integral
\[
\int_Dg_0(x)dx
\] where \(D\) is some domain of integration and \(g_0(.)\) is a function.
Let \(f(x)\) be the density of some random variable with \(f(x)>0, \forall x\in D\). Then we can rewrite the integral as
Similarly, if we are given \(h(x)\) and we want to compute \(\mathbb{E}_f[h(X)]\) where \(X\sim f,\,x\in D\). What if we could not sample from \(f\) directly?
If there were some other distribution \(g_1(x)\) we could sample from, such that \(g_1(x)>0,\,x\in D\), then
Assume we have an un-normalized \(g(x)\), i.e. \(\pi(x)=cg(x)\) but \(c\) is unknown.
We want to generate iid samples \(x_1,x_2,\ldots,x_M\sim \pi\) to estimate \(\mathbb{E}_\pi[h(X)]\).
Now assume we have an easily sampled density \(f\), and known \(K>0\), such that \(Kf(x)\ge g(x),\;\forall x\), i.e. \(Kf(x)\ge \pi(x)/c\) (or \(cKf(x)\ge \pi(x)\)).
MC Methods
rejection sampling
Then use the following procedure:
sample \(X\sim f\) and \(U\sim \mathrm{uniform}[0,1]\)
if \(U\le\frac{g(x)}{Kf(x)}\), the accept \(x\) as a draw from \(\pi\)
otherwise reject the sample and repeat
MC Methods
rejection sampling
Since \(0\le\frac{g(x)}{Kf(x)}\le 1\) we know that \(\mathbb{P}\left(U\le\frac{g(X)}{Kf(X)}|X=x\right)=\frac{g(x)}{Kf(x)}\)
and so we have the joint probability (above - \(\mathbb{P}(A,B)\)), and the probability of acceptance (previous slide - \(\mathbb{P}(B)\)), so the probability, conditional on acceptance (\(\mathbb{P}(A|B)\)) is \(\frac{\mathbb{P}(A,B)}{\mathbb{P}(B)}\) by Bayes rule.
A stochastic process, which we will usually write as \((X_n)\), is an indexed sequence of random variables that are possibly dependent on each other.
Each random variable \(X_n\) takes a value in a state space \(\mathcal S\) which is the set of possible values for the process. As with usual random variables, the state space \(\mathcal S\) can be discrete or continuous.
A discrete state space denotes a set of distinct possible outcomes, which can be finite or countably infinite. For example, \(\mathcal S = \{\text{Heads},\text{Tails}\}\) is the state space for a single coin flip, while in the case of counting insurance claims, the state space would be the nonnegative integers \(\mathcal S = \mathbb Z_+ = \{0,1,2,\dots\}\).
Stochastic processes
The stochastic process has an index set that orders the random variables that make up the process.
The index set is usually interpreted as a time variable, telling us when the process will be measured. The index set for time can also be discrete or continuous. Discrete time denotes a process sampled at distinct points, often denoted by \(n = 0,1,2,\dots\), while continuous time denotes a process monitored constantly over time, often denoted by \(t \in \mathbb R_+ = \{x \in \mathbb R : x \geq 0\}\).
Markov property
Consider a simple board game where we roll a dice and move that many squares forward on the board. Suppose we are currently on the square \(X_n\). What can we say about which square \(X_{n+1}\) we move to on our next turn?
\(X_{n+1}\) is random, since it depends on the roll of the dice.
\(X_{n+1}\) depends on where we are now \(X_n\), since the score of dice will be added onto the number our current square,
Given the square \(X_n\) we are now, \(X_{n+1}\) doesn’t depend on which sequence of squares \(X_0, X_1, \dots, X_{n-1}\) we used to get here.
Markov property
The third point is called the Markov property or memoryless property.
The key property of Markov Chains is memoryless: to get the next value we only need to know where we are, not which squares we visited to get here.
The past and the future are conditionally independent given the present.
Markov Chains
Consider the following simple random walk on the positive integers \(\mathbb Z\):
We start at \(0\), then at each time step, we go up by one with probability \(p\) and down by one with probability \(q = 1-p\). When \(p = q = \frac12\), we’re equally as likely to go up as down, and we call this the simple symmetric random walk.
The simple random walk is a simple but very useful model for lots of processes, like stock prices, sizes of populations, or positions of gas particles. (In many modern models, however, these have been replaced by more complicated continuous time and space models.) The simple random walk is sometimes called the “drunkard’s walk.”
We can write this as a stochastic process \((X_n)\) with discrete time \(n = \{0,1,2,\dots\} = \mathbb Z_+\) and discrete state space \(\mathcal S = \mathbb Z\), where \(X_0 = 0\) and, for \(n \geq 0\), we have
It’s clear from this definition that \(X_{n+1}\) (the future) depends on \(X_n\) (the present), but, given \(X_n\), does not depend on \(X_{n-1}, \dots, X_1, X_0\) (the past). Thus the Markov property holds, and the simple random walk is a discrete time Markov process or Markov chain.
Markov Chains
To define a Markov chain, we first need to say where we start from, and then what the probabilities of transitions from one state to another are.
In our examples of the simple random walk and gambler’s ruin, we specified the start point \(X_0 = i\) exactly, but we could also pick the start point at random according to some distribution \(\pi_i = \mathbb P(X_0 = i)\).
Markov Chains
We also need to know the transition probabilities \(\mathbb P(X_{n+1} = j \mid X_n = i)\) for \(i,j \in \mathcal S\). Because of the Markov property, the transition probability only needs to condition on the state we’re in now \(X_n = i\), and not on the whole history of the process.
In the case of the simple random walk, for example, we had initial distribution
For the random walk (and also the gambler’s ruin), the transition probabilities \(\mathbb P(X_{n+1} = j \mid X_n = i)\) don’t depend on \(n\); in other words, the transition probabilities stay the same over time.
A Markov process with this property is called time homogeneous. We will only consider time homogeneous processes in this course.
Markov Chains
Write \(p_{ij} = \mathbb P(X_{n+1} = j \mid X_n = i)\) for the transition probabilities, which are independent of \(n\). We must have \(p_{ij} \geq 0\), since it is a probability, and we must also have \(\sum_j p_{ij} = 1\) for all states \(i\), as this is the sum of the probabilities of all the places you can move to from state i.
When the state space is finite (and even sometimes when it’s not), it’s convenient to write the transition probabilities \((p_{ij})\) as a matrix \(\mathsf P\), called the transition matrix, whose \((i,j)\)th entry is \(p_{ij}\).
Then the condition that \(\sum_j p_{ij} = 1\) is the condition that each of the rows of \(\mathsf P\) add up to \(1\).
Markov Chains
Consider a simple two-state Markov chain with state space \(\mathcal S = \{0,1\}\) and transition matrix \[ \mathsf P = \begin{pmatrix} p_{00} & p_{01} \\ p_{10} & p_{11} \end{pmatrix} = \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix} \] for some \(0 < \alpha<1, 0>\beta < 1\). Note that the rows of \(\mathsf P\) add up to \(1\), as they must.
like
We can illustrate \(\mathsf P\) by a transition diagram/graph, where the nodes/vertices are the states and the arrows give the transition probabilities. (We don’t draw the arrow if \(p_{ij} = 0\).) In this case, our transition diagram looks the graph on the next slide:
Markov Chains
Transition diagram for the two-state Markov chain
Markov Chains
We can use this graph as a simple model of customer churn, for example. If the customer has closed their account (state 0) on one period, then with probability \(\alpha\) we will be able to entice them to open their account again (state 1) by the next period; while if the customer has an account (state 1), then with probability \(\beta\) they will have closed their account (state 0) by the next period.
Markov Chains
If the customer has an account in period \(n\), what’s the probability they also have an account in period \(n+2\)?
In the second equality, we used the Markov property to mean conditional probabilities like \(\mathbb P(X_{n+2} = 1 \mid X_{n+1} = k)\) did not have to depend on \(X_n\).
Another way to think of this as we summing the probabilities of all length-2 paths from 1 to 1, which are \(1\to 0\to 1\) with probability \(\beta\alpha\) and \(1 \to 1 \to 1\) with probability \((1-\beta)^2\)
Markov Chains
In the above example, we calculated a two-step transition probability:
But this is exactly the formula for multiplying the matrix \(\mathsf P\) with itself! In other words, \(p_{ij}(2) = \sum_{k} p_{ik}p_{kj}\) is the \((i,j)\)th entry of the matrix square \(\mathsf P^2 = \mathsf{P}\times\mathsf{P}\). If we write \(\mathsf P(2) = (p_{ij}(2))\) for the matrix of two-step transition probabilities, we have \(\mathsf P(2) = \mathsf P^2\).
More generally, we see that this rule holds over multiple steps, provided we sum over all the possible paths \(i\to k_1 \to k_2 \to \cdots \to k_{n-1} \to j\) of length \(n\) from \(i\) to \(j\).
Markov Chains
Theorem 1 Let \((X_n)\) be a Markov chain with state space \(\mathcal S\) and transition matrix \(\mathsf P = (p_{ij})\). For \(i,j \in \mathcal S\), write \[ p_{ij}(n) = \mathbb P(X_n = j \mid X_0 = i) \] for the \(n\)-step transition probability. Then
In particular, \(p_{ij}(n)\) is the \((i,j)\)th element of the matrix power \(\mathsf P^n\), and the matrix of \(n\)-step transition probabilities is given by \(\mathsf P(n) = \mathsf P^n\).
Markov Chains
The so-called Chapman–Kolmogorov equations follow immediately from this.
Let \((X_n)\) be a Markov chain with state space \(\mathcal S\) and transition matrix \(\mathsf P = (p_{ij})\). Then, for non-negative integers \(n,m\), we have \[ p_{ij}(n+m) = \sum_{k \in \mathcal S} p_{ik}(n)p_{kj}(m) , \] or, in matrix notation, \(\mathsf P(n+m) = \mathsf P(n)\mathsf P(m)\).
Markov Chains
In other words, a trip of length \(n + m\) from \(i\) to \(j\) is a trip of length \(n\) from \(i\) to some other state \(k\), then a trip of length \(m\) from \(k\) back to \(j\), and this intermediate stop \(k\) can be any state, so we have to sum the probabilities.
Of course, once we know that \(\mathsf P(n) = \mathsf P^n\) is given by the matrix power, it’s clear to see that \(\mathsf P(n+m) = \mathsf P^{n+m} = \mathsf P^n \mathsf P^m = \mathsf P(n)\mathsf P(m)\).
Markov Chains
initial distribution
If we start from a state given by a pmf \(\boldsymbol \pi_0 = (\pi_{0,i})_{i\in S}\), then after step 1 the probability we’re in state \(j\) is \(\sum_i \pi_{0,i} p_{ij}\).
Given transition matrix \(\mathsf P\) and initial state pmf \(\boldsymbol \pi_0\), the pmf for the state after one step is \(\boldsymbol \pi_1=\boldsymbol \pi_0 \times\mathsf P\), i.e. \(\boldsymbol \pi_1\) and \(\boldsymbol \pi_0\) are row vectors, and for any given state pmf, the pmf for the states on the next step of the Markov chain is the current state pmf multiplied1 by the transition matrix.
Markov Chains
Starting from a state given by a distribution \(\boldsymbol \pi = (\pi_i)_{i\in S}\), then after step 1 the probability we’re in state \(j\) is \(\sum_i \pi_i p_{ij}\). So if \(\pi_j = \sum_i \pi_i p_{ij}\), we stay in this distribution forever. We call such a distribution a stationary distribution. We again recognise this formula as a matrix-vector multiplication, so this is \(\boldsymbol \pi = \boldsymbol \pi\mathsf P\), where \(\boldsymbol \pi\) is a row vector.
Let \((X_n)\) be a Markov chain on a state space \(\mathcal S\) with transition matrix \(\mathsf P\). Let \(\boldsymbol \pi = (\pi_i)\) be a distribution on \(\mathcal S\), in that \(\pi_i \geq 0\) for all \(i \in \mathcal S\) and \(\sum_{i \in \mathcal S} \pi_i = 1\). We call \(\boldsymbol \pi\) a stationary distribution if
or, equivalently, if \(\boldsymbol \pi = \boldsymbol \pi\mathsf P\).
Markov Chains
Note that we’re saying the distribution\(\mathbb P(X_n = i)\) stays the same; the Markov chain \((X_n)\) itself will keep moving.
One way to think is that if we started off a thousand Markov chains, choosing each starting position to be \(i\) with probability \(\pi_i\), then (roughly) \(1000 \pi_j\) of them would be in state \(j\) at any time in the future – but not necessarily the same ones each time.
Markov Chains
stationary distributions
So our definition of the stationary state pmf is: \(\pi\) such that \(\pi=\pi\times \mathsf P\). We can calculate it as follows:
along with the additional constraint \(\sum_i\pi_i=1\).
If we add a row of ones at the bottom of our matrix \(\left(I-\mathsf{P}\right)^{\top}\) and a \(1\) at the bottom of the vector on the right we can solve for \(\pi\), e.g. using qr.solve.
A Markov Chain is irreducible if you have positive probability of eventually getting to anywhere in the state space from anywhere else in the state space
A Markov Chain is aperiodic if there are no forced cycles, i.e. there do not exist disjoint non-empty subsets \(X_1,X_2,...,X_d\) for \(d≥2\), such that \(P(x,X_{i+1})=1\) for all \(x\in X_i \;(1≤i≤d−1)\), and \(P(x,X_1)=1\) for all \(x\in X_d\).
Markov Chains
MC with cycle
Markov Chain Monte Carlo
Markov Chain Monte Carlo
Suppose have complicated, high-dimensional density \(\pi = cg\) and we want samples \(X_1, X_2,\dots \sim \pi\).
Define a Markov chain \(X_0, X_1,X_2\dots\) in such a way that for large enough \(m\), we have \(X_n\sim \pi,\,\forall n\ge m\).
Then we can estimate \(\mathbb{E}_{\pi}(h) ≡ \int h(x) \pi(x) dx\) by:
propose <-function(x) rnorm(1, x, 4)step <-function(x, f, q) {## Pick new point xp <-propose(x)## Acceptance probability: alpha <-min(1, f(xp) /f(x))## Accept new point with probability alpha:if (runif(1) < alpha) x <- xp## Returning the point: x}run <-function(x, f, q, nsteps) { res <-matrix(NA, nsteps, length(x))for (i inseq_len(nsteps)) res[i,] <- x <-step(x, f, q)drop(res)}
so \(X_1\sim\pi\) too, and the Markov chain preserves \(\pi\), i.e. \(\pi\) is a stationary distribution.
MCMC Metropolis Algo
example
\(\chi=\mathbb{Z}\), \(\pi(x)=2^{-|x|}/3\), and \(q(x,y)=\frac{1}{2}\) if \(|x-y|=1\), otherwise \(0\).
reversible? Yes, it’s a Metropolis algorithm
\(\pi\) stationary? Yes, follows from reversibility
aperiodic? Yes, since \(\mathsf{P}(x,\{x\})>0\)
irreducible? Yes, since \(\pi(x)>0,\forall x\in\chi\)
MCMC Metropolis-Hastings Algo
MCMC algorithms require that the MC chain is reversible, \(\pi\)-stationary, aperiodic, and reducible.
The Metropolis algorithm uses the symmetry of the proposal distribution to ensure reversibility and irreducibility.
If the proposal distribution was not symmetric we might not have those two properties, e.g. if \(q(x,y) \gg q(y,x)\) the the Metropolis chain would spend too much time at \(y\) and not enough at \(x\), so it accepts fewer moves \(x\rightarrow y\).
MCMC Metropolis-Hastings Algo
If we only require that \(q(x,y)>0\;\textrm{iff}\; q(y,x)>0\), then replace
and the algorithm is still valid even if \(q\) is not symmetric.
MCMC Metropolis-Hastings Algo
Why is this valid?
For metropolis, the key was that the Markov chain was reversible, i.e. \(\pi(x)\mathsf{P}(x,y)=\pi(y)\mathsf{P}(y,x)\)
If instead \(A_n = \frac{\pi(Y_n)q(Y_n,X_{n-1})}{\pi(X_{n-1})q(X_{n-1},Y_n)}\), with acceptance probability \(\alpha(x,y)=\min\left[1,\frac{\pi(Y_n)q(Y_n,X_{n-1})}{\pi(X_{n-1})q(X_{n-1},Y_n)}\right]\), then
\(\pi(x)\mathsf{P}(x,y)\) is still symmetric, even if \(q\) wasn’t.
the chain is still reversible, so there is a stationary distribution \(\pi\).
if the chain is irreducible and aperiodic, then it converges to \(\pi\).
MCMC Metropolis-Hastings Algo
example: independence sampler
Here \(\{Y_n\}\sim q(\cdot)\) i,e, the \(\{Y_n\}\) are IID from some fixed density \(q\), independent of \(X_{n-1}\) and we accept \(Y_n\) if \(U_n<A_n\) where \(U_n\sim \textrm{Uniform}[0,1]\) and
where the parameter set \(\theta\) depends on the assumed data generating process \(\pi_X\).
In words: the joint probability of the parameters given the observed data is equal to (to within a scaling constant) the probability of the observed data given the parameters, times the prior probabilities of the parameters. In practice we refer to the probabilities as likelihoods, and use log-likelihoods in equation (Equation 1) to avoid numerical problems arising from the product of small probabilities.
Initial state: Bayesian parameter estimation, Binomial process
set.seed(8740)p <-0.5n <-10# range of probability [0.01 - 0.99]p_values <-seq(0.01,0.99,0.001)# prior probability is beta(1,1)pr <-dbeta(p_values,7,2)# Have to normalize given discretenesspr <- pr /sum(pr) # create the datadat <- tibble::tibble(parameters = p_values, prob = pr, x =0, step=0)
7 Bayesian updating steps
# Run for M samplesfor (i in1:8) {# have the data generating process generate a data point x <-rbinom(1, n, p)# multiply the likelihood of observing the data point at each p-value# by the prior probability of each p-value:# this gives the posterior probability for each p-value ps <-dbinom(x, n, p_values) * pr# normalize ps <- ps /sum(ps) # lines(ps~p_values, col=(i+1))# same the posterior p-value probabilities at this step dat <- dat |> dplyr::bind_rows( tibble::tibble(parameters = p_values, prob = ps, x = x) |> dplyr::mutate(step = i) )# update the prior probability of each p-value to be its posterior probability pr = ps}
Warning: Using `size` aesthetic for lines was deprecated in
ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Plots of Bayesian updating steps
dat <- dat |> dplyr::group_by(step) |> dplyr::mutate(title = dplyr::case_when( step ==0~ stringr::str_glue("Step {step}: prior mean is {(parameters*prob) |> sum() |> round(digits=3)}") , TRUE~ stringr::str_glue("Step {step}: sample is {x} & posterior mean is {(parameters*prob) |> sum() |> round(digits=3)}") ) )labels <- dat |> dplyr::distinct(step, title) |> dplyr::mutate(step =factor(step))step_labels <-split(labels$title, labels$step)step_labeller <-function(value){return(step_labels[value])}dat |> dplyr::mutate(step =factor(step)) |> dplyr::group_by(step) |>ggplot(aes(x = parameters, y=prob, color = step)) +geom_line() +geom_vline(xintercept=0.5, color="grey", size=1, linetype ="dashed") +facet_wrap(vars(step), nrow =2, labeller=labeller(step = step_labeller)) +theme(plot.margin=unit(c(5,1,5,1), 'cm')) +theme_minimal() +labs(title ="Bayesian updating", subtitle ="Binomial data; n known, true p = 0.5")
MCMC and Bayesian Analysis
It is more common to have multiple data points and multiple parameters. In this case we sample from the posterior using \(\pi_X\left(\left.X\right|\theta\right)\times\pi_\theta\left(\theta\right)\) and a MCMC method like Metropolis Hastings.
In the example to follow, we observe \((y,x)\) and we have parameters \(\theta=(a,b,\sigma)\) that we want to estimate
the data generation process is assumed to be \(y = \mathscr{N}(ax+b, \sigma^2)\)
set.seed(8740)a =0; b =5; sigma =10; sampleSize =31data <- tibble::tibble(x = (-(sampleSize-1)/2):((sampleSize-1)/2) , y = a + b * x +rnorm(n=sampleSize, mean=0, sd=sigma) )
prior =function(param){ aprior =dunif(param[1], min=0, max=10, log = T) bprior =dnorm(param[2], sd =5, log = T) sdprior =dunif(param[3], min=0, max=30, log = T)# return the log of the product of probabilities (parameters)return(aprior+bprior+sdprior)}
likelihood =function(param, x, y){ pred = param[2] + param[1] * x singlelikelihoods =dnorm(y, mean = pred, sd = param[3], log = T) sumll =sum(singlelikelihoods)# return the log of the product of probabilities (data)return(sumll)}
proposalfunction =function(param){return( rnorm(3, mean = param, sd=c(0.1,0.5,0.3)) )}
The week we’ve introduced Monte Carlo (MC) methods for sampling from probability distributions, including using those sample to estimate expected values.
We discus importance sampling, rejection sampling, and Markov Chain Monte Carlo (MCMC) methods, particularily the particularly the Metropolis and Metropolis-Hastings methods.
Finally, we applied the Metropolis-Hastings methods in Bayesian analytics, generating Bayesian estimates for the parameters of linear regression.