Monte Carlo Methods

BSMM8740-2-R-2024F [WEEK - 9]

L.L. Odette

Recap of last week

  • Last week we looked at causal effect estimation methods in greater depth, including
    • inverse probaiity weighting
    • regression adjustment
    • doubly robust estimation
    • matching estimation
    • Difference in Differences (DiD)
    • Panel Data and fixed effects (FE)

This week

  • We will explore Monte Carlo methods as a way to integrate difficult functions, and to sample from difficult probability distributions.
  • Along the way we will look at Markov Chains, which can both underlie sampling methods and provide a way to model the generation of data.
  • Finally, we look at Markov Chain Monte Carlo (MCMC) methods to sample from probability distributions.

Monte Carlo (MC) Methods

Monte Carlo (MC) Methods

Monte Carlo methods are a class of simulation-based methods that seek to avoid complicated and/or intractable mathematical computations.

Especially those that arise from probability distributions.

First we consider how ‘random numbers’ are computer generated.

Random Number Generation

Random numbers from a given distribution are drawn using random numbers from a (\([0,1]\)) uniform distribution, using the inverse transform sampling method as follows:

  1. Uniform Distribution: Start by drawing a random number (\(U\)) from a (\([0,1]\)) uniform distribution. This gives us a random variable that is uniformly distributed between \(0\) and \(1\).

  2. Cumulative Distribution Function (CDF): Every probability distribution has a CDF (\(F_X(x)\)), which gives the probability that a random variable (\(X\)) is less than or equal to (\(x\)).

  3. Inverse CDF: The key idea is that if (\(U\)) is uniformly distributed in (\([0,1]\)), then (\(F_X^{-1}(U)\)) (the inverse of the CDF) will give us a random sample from the distribution of (\(X\)).

Random Number Generation

Process:

  • Draw a random number (\(U\)) from the uniform (\([0,1]\)) distribution.
  • Use the inverse CDF of the desired distribution: (\(X = F_X^{-1}(U)\)).
  • The result (X) is a random number that follows the target distribution.

Example: For an exponential distribution with rate parameter (\(\lambda\)), the CDF is\(F_X(x) = 1 - e^{-\lambda x}\). To generate a random draw from this distribution: Draw \(U \sim \mathcal{U}(0,1)\), then use the inverse of the CDF: (\(X = -\frac{1}{\lambda} \ln(1 - U)\)), to get an exponentially distributed random variable.

This method applies broadly to any distribution where you can compute or approximate the inverse CDF.

Random Number Generation

Pseudo-random generators found in computers use a deterministic (i.e. repeatable) algorithm to generate a sequence of (apparently) random numbers on the \((0, 1)\) interval.

What defines a good random number generator (RNG) is a long period – how long it takes before the sequence repeats itself. A period of \(2^{32}\) is not enough (need at least \(2^{40}\)). Various statistical tests exist to measure “randomness” – well validated software will have gone through these checks.

Random Number Generation

Alternatives to inverse transform sampling are special algorithms that generate random samples from particular distributions, e.g. one way to sample from a Gaussian random number generation uses the Box-Muller method:

  • generate two independent uniform \([0, 1]\) random variables, and
  • use some trigonometry.

Very important: never write your own generator, always use a well validated generator from a reputable source (e.g. R).

Random Number Generation

Recall that \(\mathscr{N}(0, 1)\) Normal random variables (mean 0, variance 1) have the probability density function:

\[ p(x)=\frac{1}{2\pi}e^{-\frac{1}{2}x^2}\equiv\phi(x) \] and if \(X\sim\mathscr{N}(0, 1)\) then its CDF is:

\[ \mathbb{P}[X\le x] = \int_{-\infty}^x\phi(x)dx \]

Random Number Generation

The Box-Muller transformation method takes two independent uniform \((0, 1)\) random numbers \(y_1\), \(y_2\), and defines:

\[ \begin{align*} x_{1} & =\sqrt{-2\log y_{1}}\cos(2\pi y_{2})\\ x_2& =\sqrt{-2\log y_{1}}\sin(2\pi y_{2}) \end{align*} \]

It can be proved that \(x_1\) and \(x_2\) are independent \(\mathscr{N}(0, 1)\) random variables

Random Number Generation

Code
# The Box-Muller transformation
samples <- matrix(runif(10000), ncol=2) |> data.frame() |> 
  dplyr::mutate(
    normals = 
      purrr::map2(
        X1, X2
        ,(\(x1,x2){
          data.frame(
            y1 = sqrt( -2 * log(x1) ) * cos(2 * pi * x2)
            , y2 = sqrt( -2 * log(x1) ) * sin(2 * pi * x2) 
          )
        })
      )
  ) |> 
  tidyr::unnest(normals)  
  

p0 <- samples |> 
  tidyr::pivot_longer(-c(X1,X2)) |> 
  ggplot(aes(x=value, color=name, fill=name)) + 
  geom_histogram(
    aes(y=..density..), bins = 60, position="identity", alpha=0.3
  ) + 
  labs(x="Value", y="Density") + theme_minimal()

p1 <- samples |> 
ggplot(aes(x=y1, y=y2)) + geom_point() + coord_fixed() + theme_minimal()

p0+p1
Warning: The dot-dot notation (`..density..`) was deprecated in
ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Normal y1 vs Normal y2; independent random RVs

Random Number Generation

Your computer is only capable of producing pseudorandom numbers. These are made by running a pseudorandom number generator algorithm which is deterministic, e.g.

set.seed(340)
rnorm(n=10)
 [1] -0.1574 -1.1989 -0.8892  1.0091  0.6130  1.0072
 [7]  0.4144 -1.8579 -1.3487  0.5189
set.seed(340)
rnorm(n=10)
 [1] -0.1574 -1.1989 -0.8892  1.0091  0.6130  1.0072
 [7]  0.4144 -1.8579 -1.3487  0.5189

Once the RNG seed is set, the “random” numbers that R generates aren’t random at all. But someone looking at these random numbers would have a very hard time distinguishing these numbers from truly random numbers.

MC Methods

We can use MC methods to estimate probabilities: for a random variable \(Z\) with outcomes in a set \(\Omega\), given a subset \(S\subset\Omega\) and defining an event \(E\equiv Z\in S\), we can compute the probability of \(E\) (i.e. \(\mathbb{P}(E)\)) with of samples of \(Z\), say \(z_1,z_2,\ldots,z_M\) as

\[ \mathbb{P}(E) = \frac{1}{M}\sum_{i=1}^M 1_{z_i\in S} \]

MC Methods

Example

If \(Z\sim\mathscr{N}(1,3)\) and \(S=Z:0\le Z\le 3\) then

\[ \mathbb{P}(E) = \mathbb{P}(0\le Z\le 3) = \int_0^3\frac{1}{\sqrt{2\pi3}}e^{-\frac{(t-1)^2}{2*3}}dt \] In which case it is easier to just use R and calculate:

Code
pnorm(3, mean=1, sd=sqrt(3)) - pnorm(0, mean=1, sd=sqrt(3))
[1] 0.594

than it is to compute the integral.

MC Methods

Example continued

In case we don’t know that pnorm exists, we could do this:

MC computation
# define the event
event_E_happened <- function( x ) {
  if( 0 <= x & x <= 3 ) {
    return( TRUE ) # The event happened
  } else {
    return( FALSE ) # The event DIDN'T happen
  }
}
set.seed(8740)
# generate lots of copies of Z...
NMC <- 10000; # 10000 seems like "a lot".
rnorm( NMC, mean=1, sd=sqrt(3) ) |> 
  purrr::map_lgl(event_E_happened) |> 
  sum()/NMC
[1] 0.5941

MC Methods

Given event \(E\equiv Z\in S\),

\[ \mathbb{P}(E) = \frac{1}{M}\sum_{i=1}^M 1_{z_i\in S} \]

is the MC estimate of \(\mathbb{E}[E]\), which is unbiased because for each \(i\), \(\mathbb{E}[1_{z_i\in S}]=\mathbb{E}[E]\), and the variance is

\[ \begin{align*} \mathrm{Var}\left({\mathbb{P}(E)}\right) & =\frac{1}{M^{2}}\mathrm{Var}\left(\sum_{i=1}^{M}1_{z_{i}\in S}\right)\\ & =\frac{1}{M^{2}}\sum_{i=1}^{M}\mathrm{Var}\left(1_{z_{i}\in S}\right)=\frac{1}{M}\mathrm{Var}\left(1_{z_{i}\in S}\right) \end{align*} \]

MC Methods

This is true for any function of the random variable \(Z\), and

\[ \mathbb{E}\left[h(Z)\right]\approx \hat{h}=\frac{1}{M}\sum_{i=1}^M h(z_i) \]

and the variance of \(h(Z)\) decreases as \(1/M\) by the same reasoning.

Moments and the MGF

The \(k\)-th moment of a random variable \(X\) is the expected value of: \(X\) raised to the \(k\)-th power.

\[ \acute{\mu}_k(X)=\mathbb{E}[X^k] \] which is known as the \(k\)-th raw moment.

The \(k\)-th moment of a random variable \(X\) around some value \(c\) is known as the \(k\)-th central moment of \(X\). It is the \(k\)-th raw moment of \(X-c\).

\[ \mu_k(X)=\mathbb{E}[(X-c)^k] \]

The \(k\)-th standardized moment of a random variable \(X\) is the \(k\)-th central moment of \(X\) divided by the \(k\)-th power of the standard deviation of \(X\).

Moments and the MGF

The Taylor series expansion of \(e^{tX}\) is

\[ e^{tX} = \sum_{k=0}^\infty\frac{(tX)^k}{k!}=1+\frac{t}{1!}X+\frac{t^2}{2!}X^2+\cdots \]

The Moment Generating Function (MGF) is \(\mathbb{E}[e^{tX}]\)

\[ \begin{align*} M_X(t)=\mathbb{E}[e^{tX}] & =\sum_{k=0}^{\infty}\mathbb{E}\left[\frac{(tX)^{k}}{k!}\right]\\ & =1+\frac{t}{1!}\mathbb{E}\left[X\right]+\frac{t^{2}}{2!}\mathbb{E}\left[X^{2}\right]+\cdots \end{align*} \]

Moments and the MGF

Note that the derivatives evaluated at \(t=0\) give the corresponding moments

\[ \begin{align*} \left.\frac{d^{2}}{dt^{2}}M_{X}(t)\right|_{t=0} & =\mathbb{E}\left[X^{2}\right]\\ \left.\frac{d^{3}}{dt^{3}}M_{X}(t)\right|_{t=0} & =\mathbb{E}\left[X^{3}\right] \end{align*} \]

and so \(\left.\frac{d^{n}}{dt^{n}}M_{X}(t)\right|_{t=0}=\mathbb{E}\left[X^{n}\right]\)

Moments and the MGF

Key facts needed for the CLT

The exponential function can be defined by the limit1:

\[ e^x=\lim_{n\rightarrow\infty}\left(1+\frac{x}{n}\right)^n \]

A Taylor series around point \(a\) can be defined as2:

\[ f(x)=\sum_{n=0}^k\frac{f^{(n)}(a)}{n!}(x-a)^n+h_k(x)(x-a)^k \]

where \(\lim_{n\rightarrow\infty..}\frac{h_k(x)}{(x-a)^k}=0\)

Moments and the MGF

Other properties of the MGF

  • if \(Y=\beta_0+\beta_1X\) then

\[ M_Y(t)=\mathbb{E}[e^{t(\beta_0+\beta_1X)}]=e^{\beta_0t}M_X(\beta_1t) \]

Moments and the MGF

Other properties of the MGF

  • if \(Y=X_1+X_2+\cdots X_n\) then \(M_Y(t)=\mathbb{E}[e^{t(X_1+X_2+\cdots X_n)}]\) and:

\[ M_Y(t) = \prod_{i=1}^n M_{X_i}(t) \]

if the \(X\) are independent, and if the \(X\) are IID:

\[ M_Y(t) = \left(M_{X}(t)\right)^n \]

Moments and the MGF

Other properties of the MGF1

if \(Y=\mathscr{N}(0,1)\) then starting with the Taylor series expansion of \(e^{t^2/2}\):

\[ \begin{align*} e^{t^{2}/2} & =1+\frac{1}{2}t^{2}+\frac{1}{2^{2}}\frac{t^{4}}{2!}+\frac{1}{2^{3}}\frac{t^{6}}{3!}+\cdots\\ \left.\frac{d^{n}}{dt^{n}}M_{Y}(t)\right|_{t=0} & =\mathbb{E}\left[Y^{n}\right]=\begin{cases} 0 & n\,\mathrm{\mathrm{odd}}\\ 2^{-n/2}\frac{n!}{(n/2)!} & n\,\mathrm{even} \end{cases}\\ M_{Y}(t) & =e^{t^{2}/2} \end{align*} \]

Central Limit Theorem

Let \(X_1+X_2+\cdots X_n\) be IID with mean \(\bar{X}_n=n^{-1}\sum_{i=1}^nX_i\) and standardized mean \(\bar{Z}_n=(\bar{X}_n-\mu)/(\sigma/\sqrt{n})\)

\[ \begin{align*} \bar{Z}_{n} & =\frac{\bar{X}_{n}-\mu}{\sigma/\sqrt{n}}=\frac{\sqrt{n}}{\sigma}\left[\frac{X_{1}+X_{2}+\cdots X_{n}}{n}-\mu\right]\\ & =\frac{\sqrt{n}}{\sigma}\left[\frac{X_{1}+X_{2}+\cdots X_{n}-n\mu}{n}\right]\\ & =\sqrt{n}\left[\frac{\frac{X_{1}-\mu}{\sigma}+\frac{X_{2}-\mu}{\sigma}+\cdots\frac{X_{n}-\mu}{\sigma}}{n}\right]=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i} \end{align*} \]

Central Limit Theorem

\[ \begin{align*} M_{\bar{Z}_{n}}(t) & =\left(M_{Z/\sqrt{n}}(t)\right)^{n}=\left(M_{Z}(t/\sqrt{n})\right)^{n}\\ M_{Z}(t/\sqrt{n}) & =1+0+\frac{\left(t/\sqrt{n}\right)^{2}}{2!}+\sum_{k=3}^{\infty}M_{Z}^{(k)}(0)\frac{\left(t/\sqrt{n}\right)^{k}}{k!} \end{align*} \] Now \(M_{Z}(t/\sqrt{n}) = T_2(t/\sqrt{n}) + R_2(t/\sqrt{n})\) where \(T_2\) is the Taylor series up to order 2, and \(R_2\) is the residual - the Taylor series terms greater than order 2.

Central Limit Theorem

\[ \lim_{n\rightarrow\infty}\frac{R_2(t/\sqrt{n})}{(t/\sqrt{n})^2}=\lim_{n\rightarrow\infty}nR_2(t/\sqrt{n})=0 \] and

\[ \begin{align*} \lim_{n\rightarrow\infty}M_{\bar{Z}_{n}}(t) & =\lim_{n\rightarrow\infty}\left(1+0+\frac{\left(t/\sqrt{n}\right)^{2}}{2!}+R_{2}(t/\sqrt{n})\right)^{n}\\ & =\lim_{n\rightarrow\infty}\left(1+\frac{1}{n}\left(\frac{t^{2}}{2}+nR_{2}(t/\sqrt{n})\right)\right)^{n}=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}/2}{n}\right)^{n} \end{align*} \]

Central Limit Theorem

and since

\[ e^x=\lim_{n\rightarrow\infty}\left(1+\frac{x}{n}\right)^n \]

we have

\[ \lim_{n\rightarrow\infty}M_{\bar{Z}_{n}}(t)=\lim_{n\rightarrow\infty}\left(1+\frac{t^{2}/2}{n}\right)^{n}=e^{t^2/2} \]

i.e. the sum of \(n\) IID samples of a random variable converges to a Gaussian in the limit \(n\rightarrow\infty\).

Central Limit Theorem

The CLT says that for \(f=\mathbb{P}(E)\) if \(\sigma^2\equiv\mathrm{Var}\left(f\right)\) is finite then the error of the MC estimate

\[ e_N(f)=\bar{f}-\mathbb{E}[f] \]

is approximately Normal in distribution for large \(M\), i.e.

\[ e_N(f)\sim\sigma M^{1/2}Z \] where \(Z\sim\mathscr{N}(0,1)\)

MC Methods

Suppose we need to compute an expectation \(\mathbb{E}[h(Z)]\) for some random variable \(Z\) and some function \(h:\mathbb{R}\to\mathbb{R}\).

Monte Carlo methods avoid doing any integration and instead just generate many samples of \(Z\), say \(z_1,z_2,\ldots,z_M\) and estimate \(\mathbb{E}[h(Z)]\) as \(\frac{1}{M}\sum_{i=1}^Mh(z_i)\), relying on empirical probability. The law of large numbers states that this sample mean should be close to \(\mathbb{E}[h(Z)]\).

Monte Carlo replaces the work of computing an integral (i.e., an expectation) with the work of generating many random variables.

MC Methods

sampling

If \(X\sim\mathscr{N}(\mu,\sigma)\) and we want to compute \(\mathbb{E}[\log|X|]\), we could set up and solve the integral

\[ \begin{align*} \mathbb{E}\left[\log|X|\right] & =\int_{-\infty}^{\infty}\left(\log|t|\right)f(t;\mu,\sigma)dt\\ & =\int_{-\infty}^{\infty}\frac{\log|t|}{\sqrt{2\pi\sigma^{2}}}\exp\left\{ \frac{-(t-\mu)^{2}}{2\sigma^{2}}\right\} dt \end{align*} \]

Alternatively, we could just draw lots of Monte Carlo replicates \(X_1,X_2,\cdots,X_M\) from a normal with mean \(\mu\) and variance \(\sigma^2\), and look at the sample mean \(M^{-1}\sum_{i=1}^M\log|x_i|\), once again appealing to the law of large numbers to ensure that this sample mean is close to its expectation.

MC Methods

sampling

Suppose that we want to compute an integral

\[ \int_Dg_0(x)dx \] where \(D\) is some domain of integration and \(g_0(.)\) is a function.

Let \(f(x)\) be the density of some random variable with \(f(x)>0, \forall x\in D\). Then we can rewrite the integral as

\[ \int_Dg_0(x)dx = \int_D\frac{g_0(x)}{f(x)}f(x)dx = \mathbb{E}_f[h(x)] \]

where \(h(x)=g_0(x)/f(x)\) and \(X\sim f\)

MC Methods

Similarly, if we are given \(h(x)\) and we want to compute \(\mathbb{E}_f[h(X)]\) where \(X\sim f,\,x\in D\). What if we could not sample from \(f\) directly?

If there were some other distribution \(g_1(x)\) we could sample from, such that \(g_1(x)>0,\,x\in D\), then

\[ \begin{align*} \mathbb{E}_{f}\left[h(x)\right] & =\int_{D}h(x)f(x)dx\\ & =\int_{S}h(x)\frac{f(x)}{g_1(x)}g_1(x)dx=\mathbb{E}_{g_1}\left[h(x)\frac{f(x)}{g_1(x)}\right]\\ & =\frac{1}{n}\sum_{i=1}^{n}h(x_{i})\frac{f(x_{i})}{g_1(x_{i})}\quad x_{i}\sim g_1 \end{align*} \]

MC Methods

This method is called importance sampling (IS).

  1. draw iid \(x_1,x_2,\ldots,x_n\) from \(g_1\) and calculate the importance weight

\[ w(x_i)=\frac{f(x_{i})}{g_1(x_{i})} \] 2. estimate \(\mathbb{E}_f\left[h(X)\right]\) by

\[ \hat{\mu}_h=\frac{1}{n}\sum_{i=1}^nw(x_i)h(x_i) \]

MC Methods

example

Estimate \(\mathbb E_f(X)\) where \(f(x) = \sqrt{2/\pi}e^{-\frac{x^2}{2}};\;x\ge 0\) (this is the half-Normal distribution)

integrate the half-normal
n <- 5000
X <- rexp(n, rate=2)
# calculate the importance weights
W <- exp(-0.5 * X^2 + 2*X) / sqrt(2 * pi)

mu_h  <- mean(W*X)
var_h <- var(W*X)/n
se_h  <- sqrt(var_h)

tibble::tibble(mean = mu_h,  variance = var_h, 'standard error' = se_h) |> 
  gt::gt() |> 
  gt::fmt_number(decimals=4) |> 
  gt::tab_options( table.font.size = gt::px(30) ) |>  
  gt::as_raw_html()
mean variance standard error
0.8208 0.0003 0.0186

MC Methods

repeating the prior example, but un-normalized

Estimate \(\mathbb E_f(X)\) where \(f(x) = e^{-\frac{x^2}{2}};\;x\ge 0\) (this is the half-Normal distribution, un-normalized)

integrate the (un-normalized) half-normal
# un-normalized weights
n <- 5000
X <- rexp(n, rate=2)
# calculate the weights
W <- exp(-0.5 * X^2 + 2*X)

mu_h2  <- sum(W*X)/sum(W)
var_h2 <- var(W/mean(W))
se_h2  <- sqrt(var_h2)

tibble::tibble(mean = mu_h2,  variance = var_h2, 'standard error' = se_h2) |> 
  gt::gt() |> 
  gt::fmt_number(decimals=4) |> 
  gt::tab_options( table.font.size = gt::px(30) ) |>  
  #gt::tab_header(title = )
  gt::as_raw_html()
mean variance standard error
0.7994 0.4183 0.6468

MC Methods

rejection sampling

  • Assume we have an un-normalized \(g(x)\), i.e. \(\pi(x)=cg(x)\) but \(c\) is unknown.
  • We want to generate iid samples \(x_1,x_2,\ldots,x_M\sim \pi\) to estimate \(\mathbb{E}_\pi[h(X)]\).
  • Now assume we have an easily sampled density \(f\), and known \(K>0\), such that \(Kf(x)\ge g(x),\;\forall x\), i.e. \(Kf(x)\ge \pi(x)/c\) (or \(cKf(x)\ge \pi(x)\)).

MC Methods

rejection sampling

Then use the following procedure:

  1. sample \(X\sim f\) and \(U\sim \mathrm{uniform}[0,1]\)
  2. if \(U\le\frac{g(x)}{Kf(x)}\), the accept \(x\) as a draw from \(\pi\)
  3. otherwise reject the sample and repeat

MC Methods

rejection sampling

  • Since \(0\le\frac{g(x)}{Kf(x)}\le 1\) we know that \(\mathbb{P}\left(U\le\frac{g(X)}{Kf(X)}|X=x\right)=\frac{g(x)}{Kf(x)}\)

  • and so

\[ \begin{align*} \mathbb{E}_{f}\left[\mathbb{P}\left(U\le\frac{g(X)}{Kf(X)}|X=x\right)\right] & =\mathbb{E}_{f}\left[\frac{g(X)}{Kf(X)}\right]\\ & =\int_{-\infty}^{\infty}\frac{g(x)}{Kf(x)}f(x)dx\\ & =\int_{-\infty}^{\infty}\frac{g(x)}{K}dx \end{align*} \]

MC Methods

rejection sampling

Similarly, for any \(y\in\mathbb{R}\), we can calculate the joint probability

\[ \begin{align*} \mathbb{P}\left(X\le y,U\le\frac{g(X)}{Kf(X)}\right) & =\mathbb{E}_f\left[1_{X\le y}1_{U\le\frac{g(X)}{Kf(X)}}\right]\\ & =\mathbb{E}_f\left[1_{X\le y}\mathbb{P}\left(U\le\frac{g(X)}{Kf(X)}|X=x\right)\right]\\ & =\mathbb{E}_f\left[1_{X\le y}\frac{g(X)}{Kf(X)}\right]=\int_{-\infty}^{y}\frac{g(x)}{Kf(x)}f(x)dx\\ & =\int_{-\infty}^{y}\frac{g(x)}{K}dx \end{align*} \]

  • and so we have the joint probability (above - \(\mathbb{P}(A,B)\)), and the probability of acceptance (previous slide - \(\mathbb{P}(B)\)), so the probability, conditional on acceptance (\(\mathbb{P}(A|B)\)) is \(\frac{\mathbb{P}(A,B)}{\mathbb{P}(B)}\) by Bayes rule.

MC Methods

rejection sampling

Substituting for the numerator and denominator:

\[ \mathbb{P}\left(X\le y|U\le\frac{g(X)}{Kf(X)}\right)=\frac{\int_{-\infty}^{y}\frac{g(x)}{K}dx}{\int_{-\infty}^\infty\frac{g(X)}{K}dx}=\int_{-\infty}^{y}\pi(x)dx \]

MC Methods

rejection sampling: proof for discrete rv

We have \(\mathbb{P}(X=x) = \sum_{i=1}^n\mathbb{P}(\mathrm{reject }\,Y)^{n-1}\mathbb{P}(\mathrm{draw }\,Y=x\,\mathrm{and\, accept})\)

We also have

\[ \begin{align*} & \mathbb{P}(\mathrm{draw}\,Y=x\,\mathrm{and\,accept})\\ = & \mathbb{P}(\mathrm{draw}\,Y=x)\mathbb{P}(\left.\mathrm{accept}\,Y\right|Y=x)\\ = & f(x)\mathbb{P}\left(\left.U\le\frac{q(Y)}{Kf(Y)}\right|Y=x\right)\\ = & \frac{q(x)}{K} = \frac{\pi(x)}{cK} \end{align*} \]

MC Methods

rejection sampling: proof for discrete rv

The probability of rejection of a draw is

\[ \begin{align*} \mathbb{P}(\mathrm{{reject}}\,Y) & =\sum_{x\in D}\mathbb{P}(\mathrm{{draw}}\,Y=x\,\mathrm{and\,reject\,it})\\ & =\sum_{x\in D}f(x)\mathbb{P}\left(\left.U\ge\frac{q(Y)}{Kf(Y)}\right|Y=x\right)\\ & =\sum_{x\in D}f(x)\left(1-\frac{q(x)}{Kf(x)}\right)=1-\frac{1}{cK} \end{align*} \]

MC Methods

rejection sampling: proof for discrete rv

and so1

\[ \begin{align*} \mathbb{P}(X=x) & =\sum_{n=1}^{\infty}\mathbb{P}(\mathrm{reject}\,Y)^{n-1}\mathbb{P}(\mathrm{draw}\,Y=x\,\mathrm{and\,accept})\\ & =\sum_{n=1}^{\infty}\left(1-\frac{1}{cK}\right)^{n-1}\frac{\pi(x)}{cK}=\pi(x) \end{align*} \]

Stochastic processes

Stochastic processes

A stochastic process, which we will usually write as \((X_n)\), is an indexed sequence of random variables that are possibly dependent on each other.

Each random variable \(X_n\) takes a value in a state space \(\mathcal S\) which is the set of possible values for the process. As with usual random variables, the state space \(\mathcal S\) can be discrete or continuous.

A discrete state space denotes a set of distinct possible outcomes, which can be finite or countably infinite. For example, \(\mathcal S = \{\text{Heads},\text{Tails}\}\) is the state space for a single coin flip, while in the case of counting insurance claims, the state space would be the nonnegative integers \(\mathcal S = \mathbb Z_+ = \{0,1,2,\dots\}\).

Stochastic processes

The stochastic process has an index set that orders the random variables that make up the process.

The index set is usually interpreted as a time variable, telling us when the process will be measured. The index set for time can also be discrete or continuous. Discrete time denotes a process sampled at distinct points, often denoted by \(n = 0,1,2,\dots\), while continuous time denotes a process monitored constantly over time, often denoted by \(t \in \mathbb R_+ = \{x \in \mathbb R : x \geq 0\}\).

Markov property

Consider a simple board game where we roll a dice and move that many squares forward on the board. Suppose we are currently on the square \(X_n\). What can we say about which square \(X_{n+1}\) we move to on our next turn?

  • \(X_{n+1}\) is random, since it depends on the roll of the dice.
  • \(X_{n+1}\) depends on where we are now \(X_n\), since the score of dice will be added onto the number our current square,
  • Given the square \(X_n\) we are now, \(X_{n+1}\) doesn’t depend on which sequence of squares \(X_0, X_1, \dots, X_{n-1}\) we used to get here.

Markov property

The third point is called the Markov property or memoryless property.

The key property of Markov Chains is memoryless: to get the next value we only need to know where we are, not which squares we visited to get here.

The past and the future are conditionally independent given the present.

Markov Chains

Consider the following simple random walk on the positive integers \(\mathbb Z\):

We start at \(0\), then at each time step, we go up by one with probability \(p\) and down by one with probability \(q = 1-p\). When \(p = q = \frac12\), we’re equally as likely to go up as down, and we call this the simple symmetric random walk.

The simple random walk is a simple but very useful model for lots of processes, like stock prices, sizes of populations, or positions of gas particles. (In many modern models, however, these have been replaced by more complicated continuous time and space models.) The simple random walk is sometimes called the “drunkard’s walk.”

Markov Chains

random walks
require(ggplot2, quietly = TRUE)
set.seed(315)

rrw <- function(n, p = 1/2) {
  q <- 1 - p
  Z <- sample(c(1, -1), n, replace = TRUE, prob = c(p, q))
  X <- c(0, cumsum(Z))
  c(0, cumsum(Z))
}

n <- 2000
rw_dat <- tibble::tibble(x=0:n) |> 
  dplyr::mutate(
    "p = 2/3" = rrw(n, 2/3)
    , "p = 1/3" = rrw(n, 1/3)
    , "p = 1/2" = rrw(n, 1/2)
  )

p0 <- rw_dat |> dplyr::slice_head(n=20) |> 
  tidyr::pivot_longer(cols = -x) |> 
  ggplot(aes(x=x,y=value, color=name)) + 
  geom_line() + 
  theme_minimal()

p1 <- rw_dat |> dplyr::slice_head(n=200) |> 
  tidyr::pivot_longer(cols = -x) |> 
  ggplot(aes(x=x,y=value, color=name)) + 
  geom_line() + 
  theme_minimal()

p3 <- rw_dat |> #dplyr::slice_head(n=200) |> 
  tidyr::pivot_longer(cols = -x) |> 
  ggplot(aes(x=x,y=value, color=name)) + 
  geom_line() + 
  theme_minimal()

p0+p1+p3

Markov Chains

We can write this as a stochastic process \((X_n)\) with discrete time \(n = \{0,1,2,\dots\} = \mathbb Z_+\) and discrete state space \(\mathcal S = \mathbb Z\), where \(X_0 = 0\) and, for \(n \geq 0\), we have

\[ X_{n+1} = \begin{cases} X_n + 1 & \text{with probability $p$,} \\ X_n - 1 & \text{with probability $1-p$} \end{cases} \]

It’s clear from this definition that \(X_{n+1}\) (the future) depends on \(X_n\) (the present), but, given \(X_n\), does not depend on \(X_{n-1}, \dots, X_1, X_0\) (the past). Thus the Markov property holds, and the simple random walk is a discrete time Markov process or Markov chain.

Markov Chains

To define a Markov chain, we first need to say where we start from, and then what the probabilities of transitions from one state to another are.

In our examples of the simple random walk and gambler’s ruin, we specified the start point \(X_0 = i\) exactly, but we could also pick the start point at random according to some distribution \(\pi_i = \mathbb P(X_0 = i)\).

Markov Chains

We also need to know the transition probabilities \(\mathbb P(X_{n+1} = j \mid X_n = i)\) for \(i,j \in \mathcal S\). Because of the Markov property, the transition probability only needs to condition on the state we’re in now \(X_n = i\), and not on the whole history of the process.

In the case of the simple random walk, for example, we had initial distribution

\[ \lambda_i = \mathbb P(X_0 = i) = \begin{cases} 1 & \text{if $i = 0$} \\ 0 & \text{otherwise} \end{cases} \]

and transition probabilities

\[ \mathbb P(X_{n+1} = j \mid X_n = i) = \begin{cases} p & \text{if $j = i+1$} \\ q & \text{if $j = i-1$} \\ 0 & \text{otherwise.} \end{cases} \]

Markov Chains

For the random walk (and also the gambler’s ruin), the transition probabilities \(\mathbb P(X_{n+1} = j \mid X_n = i)\) don’t depend on \(n\); in other words, the transition probabilities stay the same over time.

A Markov process with this property is called time homogeneous. We will only consider time homogeneous processes in this course.

Markov Chains

Write \(p_{ij} = \mathbb P(X_{n+1} = j \mid X_n = i)\) for the transition probabilities, which are independent of \(n\). We must have \(p_{ij} \geq 0\), since it is a probability, and we must also have \(\sum_j p_{ij} = 1\) for all states \(i\), as this is the sum of the probabilities of all the places you can move to from state i.

When the state space is finite (and even sometimes when it’s not), it’s convenient to write the transition probabilities \((p_{ij})\) as a matrix \(\mathsf P\), called the transition matrix, whose \((i,j)\)th entry is \(p_{ij}\).

Then the condition that \(\sum_j p_{ij} = 1\) is the condition that each of the rows of \(\mathsf P\) add up to \(1\).

Markov Chains

Consider a simple two-state Markov chain with state space \(\mathcal S = \{0,1\}\) and transition matrix \[ \mathsf P = \begin{pmatrix} p_{00} & p_{01} \\ p_{10} & p_{11} \end{pmatrix} = \begin{pmatrix} 1-\alpha & \alpha \\ \beta & 1-\beta \end{pmatrix} \] for some \(0 < \alpha<1, 0>\beta < 1\). Note that the rows of \(\mathsf P\) add up to \(1\), as they must.

like

We can illustrate \(\mathsf P\) by a transition diagram/graph, where the nodes/vertices are the states and the arrows give the transition probabilities. (We don’t draw the arrow if \(p_{ij} = 0\).) In this case, our transition diagram looks the graph on the next slide:

Markov Chains

Transition diagram for the two-state Markov chain

Markov Chains

We can use this graph as a simple model of customer churn, for example. If the customer has closed their account (state 0) on one period, then with probability \(\alpha\) we will be able to entice them to open their account again (state 1) by the next period; while if the customer has an account (state 1), then with probability \(\beta\) they will have closed their account (state 0) by the next period.

Markov Chains

If the customer has an account in period \(n\), what’s the probability they also have an account in period \(n+2\)?

\[ p_{11}(2) = \mathbb P (X_{n+2} = 1 \mid X_n = 1) \]

The key to calculating this is to condition on the first step again – that is, on whether the customer has an account in period \(n\). We have

\[ \begin{align*} p_{11}(2) &= \mathbb P (X_{n+1} = 0 \mid X_n = 1)\,\mathbb P (X_{n+2} = 1 \mid X_{n+1} = 0, X_n = 1) \\ &\qquad{} + \mathbb P (X_{n+1} = 1 \mid X_n = 1)\,\mathbb P (X_{n+2} = 1 \mid X_{n+1} = 1, X_n = 1) \\ &= \mathbb P (X_{n+1} = 0 \mid X_n = 1)\,\mathbb P (X_{n+2} = 1 \mid X_{n+1} = 0) \\ &\qquad{} + \mathbb P (X_{n+1} = 1 \mid X_n = 1)\,\mathbb P (X_{n+2} = 1 \mid X_{n+1} = 1) \\ &= p_{10}p_{01} + p_{11}p_{11} \\ &= \beta\alpha + (1-\beta)^2 \end{align*} \]

Markov Chains

In the second equality, we used the Markov property to mean conditional probabilities like \(\mathbb P(X_{n+2} = 1 \mid X_{n+1} = k)\) did not have to depend on \(X_n\).

Another way to think of this as we summing the probabilities of all length-2 paths from 1 to 1, which are \(1\to 0\to 1\) with probability \(\beta\alpha\) and \(1 \to 1 \to 1\) with probability \((1-\beta)^2\)

Markov Chains

In the above example, we calculated a two-step transition probability:

\[ p_{ij}(2) = \mathbb P (X_{n+2} = j \mid X_n = i) \]

by conditioning on the first step. That is, by considering all the possible intermediate steps \(k\), we have

\[ \begin{align*} p_{ij}(2) & =\sum_{k\in\mathcal{S}}\mathbb{P}(X_{n+1}=k\mid X_{n}=i)\mathbb{P}(X_{n+2}=j\mid X_{n+1}=k)\\ & =\sum_{k\in\mathcal{S}}p_{ik}p_{kj} \end{align*} \]

But this is exactly the formula for multiplying the matrix \(\mathsf P\) with itself! In other words, \(p_{ij}(2) = \sum_{k} p_{ik}p_{kj}\) is the \((i,j)\)th entry of the matrix square \(\mathsf P^2 = \mathsf{P}\times\mathsf{P}\). If we write \(\mathsf P(2) = (p_{ij}(2))\) for the matrix of two-step transition probabilities, we have \(\mathsf P(2) = \mathsf P^2\).

More generally, we see that this rule holds over multiple steps, provided we sum over all the possible paths \(i\to k_1 \to k_2 \to \cdots \to k_{n-1} \to j\) of length \(n\) from \(i\) to \(j\).

Markov Chains

Theorem 1 Let \((X_n)\) be a Markov chain with state space \(\mathcal S\) and transition matrix \(\mathsf P = (p_{ij})\). For \(i,j \in \mathcal S\), write \[ p_{ij}(n) = \mathbb P(X_n = j \mid X_0 = i) \] for the \(n\)-step transition probability. Then

\[ p_{ij}(n) = \sum_{k_1, k_2, \dots, k_{n-1} \in \mathcal S} p_{ik_1} p_{k_1k_2} \cdots p_{k_{n-2}k_{n-1}} p_{k_{n-1}j} \]

In particular, \(p_{ij}(n)\) is the \((i,j)\)th element of the matrix power \(\mathsf P^n\), and the matrix of \(n\)-step transition probabilities is given by \(\mathsf P(n) = \mathsf P^n\).

Markov Chains

The so-called Chapman–Kolmogorov equations follow immediately from this.

Let \((X_n)\) be a Markov chain with state space \(\mathcal S\) and transition matrix \(\mathsf P = (p_{ij})\). Then, for non-negative integers \(n,m\), we have \[ p_{ij}(n+m) = \sum_{k \in \mathcal S} p_{ik}(n)p_{kj}(m) , \] or, in matrix notation, \(\mathsf P(n+m) = \mathsf P(n)\mathsf P(m)\).

Markov Chains

In other words, a trip of length \(n + m\) from \(i\) to \(j\) is a trip of length \(n\) from \(i\) to some other state \(k\), then a trip of length \(m\) from \(k\) back to \(j\), and this intermediate stop \(k\) can be any state, so we have to sum the probabilities.

Of course, once we know that \(\mathsf P(n) = \mathsf P^n\) is given by the matrix power, it’s clear to see that \(\mathsf P(n+m) = \mathsf P^{n+m} = \mathsf P^n \mathsf P^m = \mathsf P(n)\mathsf P(m)\).

Markov Chains

initial distribution

If we start from a state given by a pmf \(\boldsymbol \pi_0 = (\pi_{0,i})_{i\in S}\), then after step 1 the probability we’re in state \(j\) is \(\sum_i \pi_{0,i} p_{ij}\).

Given transition matrix \(\mathsf P\) and initial state pmf \(\boldsymbol \pi_0\), the pmf for the state after one step is \(\boldsymbol \pi_1=\boldsymbol \pi_0 \times\mathsf P\), i.e. \(\boldsymbol \pi_1\) and \(\boldsymbol \pi_0\) are row vectors, and for any given state pmf, the pmf for the states on the next step of the Markov chain is the current state pmf multiplied1 by the transition matrix.

Markov Chains

Starting from a state given by a distribution \(\boldsymbol \pi = (\pi_i)_{i\in S}\), then after step 1 the probability we’re in state \(j\) is \(\sum_i \pi_i p_{ij}\). So if \(\pi_j = \sum_i \pi_i p_{ij}\), we stay in this distribution forever. We call such a distribution a stationary distribution. We again recognise this formula as a matrix-vector multiplication, so this is \(\boldsymbol \pi = \boldsymbol \pi\mathsf P\), where \(\boldsymbol \pi\) is a row vector.

Let \((X_n)\) be a Markov chain on a state space \(\mathcal S\) with transition matrix \(\mathsf P\). Let \(\boldsymbol \pi = (\pi_i)\) be a distribution on \(\mathcal S\), in that \(\pi_i \geq 0\) for all \(i \in \mathcal S\) and \(\sum_{i \in \mathcal S} \pi_i = 1\). We call \(\boldsymbol \pi\) a stationary distribution if

\[ \pi_j = \sum_{i\in \mathcal S} \pi_i p_{ij} \quad \text{for all $j \in \mathcal S$} \]

or, equivalently, if \(\boldsymbol \pi = \boldsymbol \pi\mathsf P\).

Markov Chains

Note that we’re saying the distribution \(\mathbb P(X_n = i)\) stays the same; the Markov chain \((X_n)\) itself will keep moving.

One way to think is that if we started off a thousand Markov chains, choosing each starting position to be \(i\) with probability \(\pi_i\), then (roughly) \(1000 \pi_j\) of them would be in state \(j\) at any time in the future – but not necessarily the same ones each time.

Markov Chains

stationary distributions

So our definition of the stationary state pmf is: \(\pi\) such that \(\pi=\pi\times \mathsf P\). We can calculate it as follows:

\[ \begin{align*} \pi-\pi\mathsf{P} & =0\\ \pi\left(I-\mathsf{P}\right) & =0\\ \left(I-\mathsf{P}\right)^{\top}\pi^{\top} & =0 \end{align*} \]

along with the additional constraint \(\sum_i\pi_i=1\).

If we add a row of ones at the bottom of our matrix \(\left(I-\mathsf{P}\right)^{\top}\) and a \(1\) at the bottom of the vector on the right we can solve for \(\pi\), e.g. using qr.solve.

Markov Chains

stationary distribution example

Code
P = matrix(
  c(
    0.2, 0.3, 0.5, 0,
    0, 0.1, 0.1, 0.8,
    0.5, 0.2, 0, 0.3,
    0.3,0.1,0.3,0.3
  )
  , nrow = 4, byrow = TRUE
)
P
     [,1] [,2] [,3] [,4]
[1,]  0.2  0.3  0.5  0.0
[2,]  0.0  0.1  0.1  0.8
[3,]  0.5  0.2  0.0  0.3
[4,]  0.3  0.1  0.3  0.3
Code
A <- t(diag(1,4) - P)
A
     [,1] [,2] [,3] [,4]
[1,]  0.8  0.0 -0.5 -0.3
[2,] -0.3  0.9 -0.2 -0.1
[3,] -0.5 -0.1  1.0 -0.3
[4,]  0.0 -0.8 -0.3  0.7
Code
pi <- 
  qr.solve(
    A |> rbind(c(1,1,1,1))
    , c(0,0,0,0,1)
    , tol = 1e-10
  )
pi
[1] 0.2686 0.1782 0.2447 0.3085
Code
pi %*% P
       [,1]   [,2]   [,3]   [,4]
[1,] 0.2686 0.1782 0.2447 0.3085

Markov Chains

properties

  • A Markov Chain is irreducible if you have positive probability of eventually getting to anywhere in the state space from anywhere else in the state space

  • A Markov Chain is aperiodic if there are no forced cycles, i.e. there do not exist disjoint non-empty subsets \(X_1,X_2,...,X_d\) for \(d≥2\), such that \(P(x,X_{i+1})=1\) for all \(x\in X_i \;(1≤i≤d−1)\), and \(P(x,X_1)=1\) for all \(x\in X_d\).

Markov Chains

MC with cycle

Markov Chain Monte Carlo

Markov Chain Monte Carlo

Suppose have complicated, high-dimensional density \(\pi = cg\) and we want samples \(X_1, X_2,\dots \sim \pi\).

Define a Markov chain \(X_0, X_1,X_2\dots\) in such a way that for large enough \(m\), we have \(X_n\sim \pi,\,\forall n\ge m\).

Then we can estimate \(\mathbb{E}_{\pi}(h) ≡ \int h(x) \pi(x) dx\) by:

\[ \mathbb{E}_{\pi}[h] \approx \frac{1}{M-B}\sum_{i=B+1}^{M}h(x_i) \]

where \(B\) (“burn-in”) is chosen large enough so \(X_B\sim\pi\), and \(M\) is chosen large enough to get good Monte Carlo estimates.

MCMC Metropolis Algo

  1. choose some initial value \(X_0\), then
  2. given \(X_{n-1}\), choose a proposal state \(Y_n\sim \textrm{MVN}(X_{n-1},\sigma^2\textrm{I})\), for some fixed \(\sigma^2>0\)
  3. let \(A_n=\pi(Y_n)/\pi(X_{n-1})=g(Y_n)/g(X_{n-1})\) and \(U_n\sim\textrm{U}[0,1]\), then
  4. if \(U_n<A_n\) set \(X_n=Y_n\) (“accept”), otherwise set \(X_n = X_{n-1}\) (“reject”)
  5. repeat for \(n=1,2,3,\ldots,M\)

This version is called random-walk Metropolis - if we always accepted the proposals, they would form a traditional random walk process.

MCMC Metropolis Algo

  • the burn-in period is a matter of trial and error
  • the start values don’t matter much, but central. ones are ‘better’
  • if \(\sigma\) is too small then we usually accept and the chain doesn’t move much
  • if \(\sigma\) is too big then we usually reject and the chain doesn’t move much
  • generally the acceptance rate should be far from both zero and 1

MCMC Metropolis Algo

The variance of the estimate is

\[ \frac{1}{M-B}\textrm{Var}_{\pi}(h)\times \textrm{varfact} \]

where

\[ \textrm{varfact} = 1+2\sum_{k=1}^{\infty}\textrm{Corr}_{\pi} \left(h(X_0),h(X_k)\right)=\sum_{-\infty}^{\infty}\rho_k \]

MCMC Metropolis Algo

a simple Metropolis algorithm in one dimension (continuous RV)
g = function(y) {
    if ( (y<0) || (y>1) )
    return(0)
    else
    return( y^3 * sin(y^4) * cos(y^5) )
}

h = function(y) { return(y^2) }

M = 11000  # run length
B = 1000  # amount of burn-in
X = runif(1)  # overdispersed starting distribution
sigma = 1  # proposal scaling
xlist = rep(0,M)  # for keeping track of chain values
hlist = rep(0,M)  # for keeping track of h function values
numaccept = 0;

for (i in 1:M) {
  Y = X + sigma * rnorm(1)  # proposal value
  U = runif(1)              # for accept/reject
  alpha = g(Y) / g(X)       # for accept/reject
  if (U < alpha) {
    X = Y                   # accept proposal
    numaccept = numaccept + 1;
  }
    xlist[i] = X;
    hlist[i] = h(X);
}

u = mean(hlist[(B+1):M])
se1 =  sd(hlist[(B+1):M]) / sqrt(M-B)
varfact <- function(xxx) { 2 * sum(acf(xxx, plot=FALSE)$acf) - 1 }
thevarfact = varfact(hlist[(B+1):M])
se = se1 * sqrt( thevarfact )
Code
tibble::tibble(
  measure = c("iterations","burn-in","mean of h","iid se","varfact","true se","95% CI lb","95% CI ub")
  , value = c(M,B,u,se1,thevarfact,se,u-1.96*se,u+1.96*se)
) |> 
gt::gt("measure") |> 
gt::fmt_number(rows = 3:8, decimals = 4) |> 
gt::fmt_number(rows = 1:2, decimals = 0) |> 
gtExtras::gt_theme_espn()
value
iterations 11,000
burn-in 1,000
mean of h 0.7730
iid se 0.0017
varfact 21.0156
true se 0.0076
95% CI lb 0.7581
95% CI ub 0.7879
Code
tibble::tibble(x=1:length(xlist), y=xlist) |> 
  ggplot(aes(x=x,y=y)) + geom_line() +
  labs(title = "Plot of accepted values")

MCMC Metropolis Algo

target density shape
p <- 0.4
mu <- c(-1, 2)
sd <- c(.5, 2)
f <- function(x)
    p     * dnorm(x, mu[1], sd[1]) +
    (1-p) * dnorm(x, mu[2], sd[2])
par(mar=c(10.1, 2, .5, .5), oma=c(0, 1, 0, 0))
curve(f(x), col="red", -4, 8, n=301, las=1)

Metropolis sampling algrithm
propose <- function(x) rnorm(1, x, 4)
step <- function(x, f, q) {
    ## Pick new point
    xp <- propose(x)
    ## Acceptance probability:
    alpha <- min(1, f(xp) / f(x))
    ## Accept new point with probability alpha:
    if (runif(1) < alpha) x <- xp
    ## Returning the point:
    x
}
run <- function(x, f, q, nsteps) {
    res <- matrix(NA, nsteps, length(x))
    for (i in seq_len(nsteps))
        res[i,] <- x <- step(x, f, q)
    drop(res)
}
Metropolis samples from target
res <- run(-10, f, q, 1000)

layout(matrix(c(1, 2), 1, 2), widths=c(4, 1))
par(mar=c(10.1, 1.5, .5, .5), oma=c(0, 1, 0, 0))
plot(res, type="s", xpd=NA, ylab="Parameter", xlab="Sample", las=1)
usr <- par("usr")
xx <- seq(usr[3], usr[4], length=301)
plot(f(xx), xx, type="l", yaxs="i", axes=FALSE, xlab="")

Metropolis samples from target: histogram
par(mar=c(10.1, 1.5, .5, .5), oma=c(0, 1, 0, 0))
hist(res, 50, freq=FALSE, main="", ylim=c(0, .4), las=1,
     xlab="x", ylab="Probability density")
z <- integrate(f, -Inf, Inf)$value
curve(f(x) / z, add=TRUE, col="red", n=200)

MCMC Metropolis Algo

Discrete case

  • stationary means that if \(X_0\sim\pi\), i.e. \(\mathbb{P}(X_0=i)=\pi(i)\), then \(\mathbb{P}(X_1=j)=\pi(j),\;\forall j\)

  • \(\mathbb{P}(X_1=j)=\sum_{i\in S}\mathbb{P}\left(X_0=i,X_1=j\right)\) which is \(\sum_{i\in S}\mathbb{P}(X_0=i)\mathsf{P}(i,j)\)

  • so \(\pi\) is stationary if \(\sum_{i\in S}\pi(i)\mathsf{P}(i,j)=\pi(j),\;\forall j\)

MCMC Metropolis Algo

Discrete case

let \(q(x,y)=\mathbb{P}(Y_n=y|X_{n-1}=x)\) be the proposal distribution

  • assume \(q\) is symmetic, i.e. \(q(x,y)=q(y,x),\;\forall x,y\)

  • then if \(\alpha(x,y)\) is the acceptance probability from \(x\) to \(y\):

\[ \begin{align*} \alpha(x,y) & =\mathbb{P}(U_{n}<A_{n}|X_{n-1}=x,Y_{n}=y)\\ & \mathbb{P}\left(U_{n}<\frac{\pi(y)}{\pi(x)}\right)=\min\left[1,\frac{\pi(y)}{\pi(x)}\right] \end{align*} \]

MCMC Metropolis Algo

Discrete case

Then, for \(i,j\) in the state space, with \(i\ne j\)

\[ \begin{align*} \mathsf{P}(i,j)=q(i,j)\alpha(i,j)=q(i,j)\min\left[1,\frac{\pi(j)}{\pi(i)}\right] \end{align*} \]

and, by symmetry of \(q\)

\[ \begin{align*} \pi(i)\mathsf{P}(i,j) & =q(i,j)\min\left(\pi(i),\pi(j)\right)\\ & =q(j,i)\min\left(\pi(j),\pi(i)\right)=\pi(j)\mathsf{P}(j,i) \end{align*} \]

MCMC Metropolis Algo

Discrete case

If \(X_0\sim\pi\), then

\[ \begin{align*} \mathbb{P}(X_{1}=j) & =\sum_{i\in\chi}\mathbb{P}(X_{0}=i)\mathsf{P}(i,j)=\sum_{i\in\chi}\pi(i)\mathsf{P}(i,j)\\ & =\sum_{i\in\chi}\pi(j)\mathsf{P}(j,i)=\pi(j)\sum_{i\in\chi}\mathsf{P}(j,i)\\ & =\pi(j) \end{align*} \]

so \(X_1\sim\pi\) too, and the Markov chain preserves \(\pi\), i.e. \(\pi\) is a stationary distribution.

MCMC Metropolis Algo

example

\(\chi=\mathbb{Z}\), \(\pi(x)=2^{-|x|}/3\), and \(q(x,y)=\frac{1}{2}\) if \(|x-y|=1\), otherwise \(0\).

  • reversible? Yes, it’s a Metropolis algorithm
  • \(\pi\) stationary? Yes, follows from reversibility
  • aperiodic? Yes, since \(\mathsf{P}(x,\{x\})>0\)
  • irreducible? Yes, since \(\pi(x)>0,\forall x\in\chi\)

MCMC Metropolis-Hastings Algo

MCMC algorithms require that the MC chain is reversible, \(\pi\)-stationary, aperiodic, and reducible.

The Metropolis algorithm uses the symmetry of the proposal distribution to ensure reversibility and irreducibility.

If the proposal distribution was not symmetric we might not have those two properties, e.g. if \(q(x,y) \gg q(y,x)\) the the Metropolis chain would spend too much time at \(y\) and not enough at \(x\), so it accepts fewer moves \(x\rightarrow y\).

MCMC Metropolis-Hastings Algo

If we only require that \(q(x,y)>0\;\textrm{iff}\; q(y,x)>0\), then replace

\[ A_n = \frac{\pi(Y_n)}{\pi(X_{n-1})}\;\mathrm{with}\; A_n = \frac{\pi(Y_n)q(Y_n,X_{n-1})}{\pi(X_{n-1})q(X_{n-1},Y_n)} \]

and the algorithm is still valid even if \(q\) is not symmetric.

MCMC Metropolis-Hastings Algo

Why is this valid?

For metropolis, the key was that the Markov chain was reversible, i.e. \(\pi(x)\mathsf{P}(x,y)=\pi(y)\mathsf{P}(y,x)\)

If instead \(A_n = \frac{\pi(Y_n)q(Y_n,X_{n-1})}{\pi(X_{n-1})q(X_{n-1},Y_n)}\), with acceptance probability \(\alpha(x,y)=\min\left[1,\frac{\pi(Y_n)q(Y_n,X_{n-1})}{\pi(X_{n-1})q(X_{n-1},Y_n)}\right]\), then

\[ \begin{align*} q(x,y)\alpha(x,y)\pi(x) & =q(x,y)\min\left[1,\frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}\right]\pi(x)\\ & =\min\left[\pi(x)q(x,y),\pi(y)q(y,x)\right] \end{align*} \]

MCMC Metropolis-Hastings Algo

Then

  • \(\pi(x)\mathsf{P}(x,y)\) is still symmetric, even if \(q\) wasn’t.

  • the chain is still reversible, so there is a stationary distribution \(\pi\).

  • if the chain is irreducible and aperiodic, then it converges to \(\pi\).

MCMC Metropolis-Hastings Algo

example: independence sampler

Here \(\{Y_n\}\sim q(\cdot)\) i,e, the \(\{Y_n\}\) are IID from some fixed density \(q\), independent of \(X_{n-1}\) and we accept \(Y_n\) if \(U_n<A_n\) where \(U_n\sim \textrm{Uniform}[0,1]\) and

\[ A_n = \frac{\pi(Y_n)q(X_{n-1})}{\pi(X_{n-1})q(Y_n)} \]

MCMC and Bayesian Analysis

The Bayesian model for updating parameter estimates is (to within a scaling constant)

\[ \begin{align*} \pi_\theta\left(\left.\theta\right|X\right)\sim\pi_X\left(\left.X\right|\theta\right)\times\pi_\theta\left(\theta\right) \end{align*} \tag{1}\]

where the parameter set \(\theta\) depends on the assumed data generating process \(\pi_X\).

In words: the joint probability of the parameters given the observed data is equal to (to within a scaling constant) the probability of the observed data given the parameters, times the prior probabilities of the parameters. In practice we refer to the probabilities as likelihoods, and use log-likelihoods in equation (Equation 1) to avoid numerical problems arising from the product of small probabilities.

MCMC and Bayesian Analysis

Initial state: Bayesian parameter estimation, Binomial process
set.seed(8740)
p <- 0.5
n <- 10
# range of probability [0.01 - 0.99]
p_values <- seq(0.01,0.99,0.001)
# prior probability is beta(1,1)
pr <- dbeta(p_values,7,2)
# Have to normalize given discreteness
pr <- pr / sum(pr) 
# create the data
dat <- tibble::tibble(parameters = p_values, prob = pr, x = 0, step= 0) 
7 Bayesian updating steps
# Run for M samples
for (i in 1:8) {
  # have the data generating process generate a data point
  x <- rbinom(1, n, p)
  # multiply the likelihood of observing the data point at each p-value
  # by the prior probability of each p-value:
  # this gives the posterior probability for each p-value
  ps <- dbinom(x, n, p_values) * pr
  # normalize
  ps <- ps / sum(ps) 
  # lines(ps~p_values, col=(i+1))
  # same the posterior p-value probabilities at this step
  dat <- dat |>
    dplyr::bind_rows(
      tibble::tibble(parameters = p_values, prob = ps, x = x) |>
        dplyr::mutate(step = i)
    )
  # update the prior probability of each p-value to be its posterior probability
  pr = ps
}
Warning: Using `size` aesthetic for lines was deprecated in
ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Plots of Bayesian updating steps
dat <- dat |> dplyr::group_by(step) |>
  dplyr::mutate(
    title =
      dplyr::case_when(
        step == 0 ~ stringr::str_glue("Step {step}: prior mean is {(parameters*prob) |> sum() |> round(digits=3)}")
        , TRUE ~ stringr::str_glue("Step {step}: sample is {x} & posterior mean is {(parameters*prob) |> sum() |> round(digits=3)}")
      )
  )

labels <- dat |> dplyr::distinct(step, title) |> dplyr::mutate(step = factor(step))
step_labels <- split(labels$title, labels$step)
step_labeller <- function(value){return(step_labels[value])}

dat |> dplyr::mutate(step = factor(step)) |> dplyr::group_by(step) |>
  ggplot(aes(x = parameters, y=prob, color = step)) +
  geom_line() +
  geom_vline(xintercept=0.5, color="grey", size=1, linetype = "dashed") +
  facet_wrap(vars(step), nrow = 2, labeller=labeller(step = step_labeller)) +
  theme(plot.margin=unit(c(5,1,5,1), 'cm')) +
  theme_minimal() + 
  labs(title = "Bayesian updating", subtitle = "Binomial data; n known, true p = 0.5")

MCMC and Bayesian Analysis

It is more common to have multiple data points and multiple parameters. In this case we sample from the posterior using \(\pi_X\left(\left.X\right|\theta\right)\times\pi_\theta\left(\theta\right)\) and a MCMC method like Metropolis Hastings.

In the example to follow, we observe \((y,x)\) and we have parameters \(\theta=(a,b,\sigma)\) that we want to estimate

  • the data generation process is assumed to be \(y = \mathscr{N}(ax+b, \sigma^2)\)
  • we choose priors on the parameters as follows
    • \(a=U[0,10]\)
    • \(b=\mathscr{N}(0,5)\)
    • \(\sigma=U[0,30]\)

MCMC and Bayesian Analysis

set.seed(8740)
a = 0; b = 5; sigma = 10; sampleSize = 31

data <- 
  tibble::tibble(
    x = (-(sampleSize-1)/2):((sampleSize-1)/2)
    , y =  a + b * x + rnorm(n=sampleSize, mean=0, sd=sigma)
  )
prior = function(param){
  aprior = dunif(param[1], min=0, max=10, log = T)
  bprior = dnorm(param[2], sd = 5, log = T)
  sdprior = dunif(param[3], min=0, max=30, log = T)
  # return the log of the product of probabilities (parameters)
  return(aprior+bprior+sdprior)
}
likelihood = function(param, x, y){
  pred = param[2] + param[1] * x 
  singlelikelihoods = dnorm(y, mean = pred, sd = param[3], log = T)
  sumll = sum(singlelikelihoods)
  # return the log of the product of probabilities (data)
  return(sumll)
}
proposalfunction = function(param){
  return( rnorm(3, mean = param, sd= c(0.1,0.5,0.3)) )
}
run_metropolis_MCMC = function(startvalue, iterations, data){
  # initialize
  chain = array(dim = c(iterations+1,3))
  chain[1,] <- startvalue
  
  for (i in 1:iterations){
    proposal = proposalfunction(chain[i,])
    probab = 
      exp(
        likelihood(proposal, data$x, data$y) + 
        prior(proposal) - 
        likelihood(chain[i,], data$x, data$y) - 
        prior(chain[i,])
      )
    if (runif(1) < probab){
      chain[i+1,] = proposal
    }else{
      chain[i+1,] = chain[i,]
    }
  }
  return(chain)
}

MCMC and Bayesian Analysis

MH algorithm generating posterior parameter probabilities
startvalue = c(4,2,8)
chain = run_metropolis_MCMC(startvalue, 20000, data)

mcm_chain <- coda::mcmc(chain)
smry <- summary(mcm_chain)
smry$statistics |> tibble::as_tibble() |> 
  dplyr::bind_cols(smry$quantiles |> tibble::as_tibble()) |>
  tibble::add_column(param = c('b','a','sigma')) |> 
  gt::gt("param") |> 
  gt::tab_spanner(label = "percentiles", columns = ends_with("%")) |> 
  gt::tab_options( table.font.size = gt::px(48) ) %>% 
  gtExtras::gt_theme_espn() |>
  gt::as_raw_html()
Mean SD Naive SE Time-series SE
percentiles
2.5% 25% 50% 75% 97.5%
b 4.8424 0.1742 0.001232 0.005517 4.498 4.7268 4.844 4.959 5.182
a 0.6129 1.4354 0.010149 0.070569 -2.173 -0.3511 0.616 1.560 3.495
sigma 8.5691 1.1006 0.007783 0.068802 6.726 7.7719 8.496 9.256 10.982

MCMC and Bayesian Analysis

posterior parameter probability plots
plot(mcm_chain)

MCMC and Bayesian Analysis

More

Recap

  • The week we’ve introduced Monte Carlo (MC) methods for sampling from probability distributions, including using those sample to estimate expected values.
  • We discus importance sampling, rejection sampling, and Markov Chain Monte Carlo (MCMC) methods, particularily the particularly the Metropolis and Metropolis-Hastings methods.
  • Finally, we applied the Metropolis-Hastings methods in Bayesian analytics, generating Bayesian estimates for the parameters of linear regression.