Assignment 2 of Advanced Computational Method

Homework 3

The Metropolis–Hastings algorithm works by generating a sequence of sample values in such a way that, as more and more sample values are produced, the distribution of values more closely approximates the desired distribution, $P(x)$. These sample values are produced iteratively, with the distribution of the next sample being dependent only on the current sample value (thus making the sequence of samples into a Markov chain). Specifically, at each iteration, the algorithm picks a candidate for the next sample value based on the current sample value. Then, with some probability, the candidate is either accepted (in which case the candidate value is used in the next iteration) or rejected (in which case the candidate value is discarded, and current value is reused in the next iteration)−the probability of acceptance is determined by comparing the values of the function $f(x)$ of the current and candidate sample values with respect to the desired distribution $P(x)$.

Problem

Draw samples from the distribution $p(x, y) = \mathrm{e}^{-(x-1)^2-100(y-x)^2}$

  1. use RW-MH, use different step size and compare the performance;
  2. use IS with different proposals of your choice and compare the performance.
  3. use adaptive MCMC and compare the performance with standard RW-MH.

For the purpose of implementation, the Metropolis algorithm, a special case of the Metropolis–Hastings algorithm where the proposal function is symmetric, is described below.

Metropolis algorithm (symmetric proposal distribution)

Let $f(x)$ be a function that is proportional to the desired probability distribution $P(x)$ (a.k.a. a target distribution).

  1. Initialization: Choose an arbitrary point x0 to be the first sample, and choose an arbitrary probability density $g(x\mid y)$ that suggests a candidate for the next sample value x, given the previous sample value y. For the Metropolis algorithm, $g$ must be symmetric; in other words, it must satisfy $q(x\mid y)=q(y\mid x)$. A usual choice is to let $q(x\mid y)$ be a Gaussian distribution centered at $y$, so that points closer to $y$ are more likely to be visited next—making the sequence of samples into a random walk. The function $g$ is referred to as the proposal density or jumping distribution.

  2. For each iteration $t$:

  • Generate : Generate a candidate x for the next sample by picking from the distribution $q(x’|x_{t})$.
  • Calculate : Calculate the acceptance ratio $\alpha =p(x)/p(x{t})$, which will be used to decide whether to accept or reject the candidate. Because $f$ is proportional to the density of $P$, we have that $\alpha =p(x’)/p(x{t})=P(x’)/P(x_{t})$
  • Accept or Reject :
    • Generate a uniform random number $u$ on $[0,1]$.
    • If $u\leq \alpha$ accept the candidate by setting $x_{t+1}=x’$
    • If $u>\alpha$ reject the candidate and set $x{t+1}=x{t}$, instead.

The Metropolis-Hastings Algorithm (MH)

1. initialize at $X^0$
2. for $i = 0$ to $n-1$
  • sample $y\sim q(\cdot \mid X^i)$
  • sample $u\sim U[0,1]$
  • $a(X^i, y)=\min\lbrace 1, \frac{p(y)}{p(X^{i})}\frac{q(X^i \mid y)}{p(y \mid X^i)}\rbrace$
  • accept/reject the proposed $y$ according to probability $a(X,y)$
  • if $u\leq a$ set $X^{i+1} = X$
    else $X^{i+1} = X^{i}$

1.

First, let’s take a look at the target distribution. Here, we present a 3D plot by R

1
2
3
4
5
x <- seq(-2, 2, length=100)
y <- x
f <- function(x, y){exp(-(x-1)^2 - 100*(y - x)^2)}
z <- outer(x, y, f)
persp(x, y, z, shade=0.1, col="white", ticktype="detailed", xlab="x", ylab="y", zlab="rosenbrock")

The picture shown above is shaped like banana as is taught to be.

Next, we follow the Metropolis-Hastings Algorithm and give a simple implement.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Define arguments: target distribution
rosenbrock <- function(x) # x is a vector
{
exp(-(x[1]-1)^2 - 100*(x[2] - x[1])^2)
}

# Define the sampling function
rwMetro <- function(target, N, x, step, burnin=1000)
{
cov <- step*diag(2)
samples <- x
for (i in 2:(burnin+N))
{
prop <- mvrnorm(n = 1, x, cov)
if (runif(1) < min(1, target(prop)/target(x)))
x <- prop
samples <- rbind(samples,x)
}
samples[(burnin+1):(N+burnin),]
}

The above forms a rosenbrock function for general usage and construct a sampling function with considerate arguments. For example, the burnin period is typically necessary, where an initial number of samples (e.g. the first default $1000$ or so) are thrown away, since the initial samples may follow a very different distribution, especially if the starting point is in a region of low density.

Hence, we can draw samples by calling the function.

1
rwMetro.samples <- rwMetro(rosenbrock, 10000, c(0,0), 1)

Here, rwMetro.samples contains 10000 rows of samples with 2 columns. Continually, we want to evaluate the performance of this sampling method. Correlations should be a key of parameter choosing. For example, here we set step size $\delta$ to be $1$ and plot the autocorrelation function below:

1
acf(rwMetro.samples, lag.max=100)

As is shown in the picture, the self-correlation (diagonal part) decays when the effective sample size equals $100$[^size], which somehow indicates the Markov chain eventually converges to the desired distribution,

[^size]: here we set $100$ to be the default number of effectively independent samples.

Simply change the step parameter in function calling/sampling, we can obtain different shows of autocorrelation functions, which decays faster implies a more independent sampling.

Based on this criterion, we can do repetition test and comparison work in RStudio. An optimal step size is selected to be quite near $1$[^exp].

[^exp]: the result is not quite stable, but we find $\delta\in[0.9,1]$ is adequate.

2.

However, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages. For example, the samples are correlated. This means that if we want a set of independent samples, we have to throw away the majority of samples.

Hence, independence sampler is necessary in some practical use. The thought comes like below:

Set

$$
\begin{split}
& p(y\mid X) = p(y) \
& a(X, y) = \frac{p(y)}{p(X)} \frac{q(X\mid y)}{p(y\mid X)} = 1 \
\end{split}
$$

which implies us:

$$
q(Y\mid X) = q(y) \approx p(y)
$$

that means $q(y)$ is a fixed distribution which is independent on $X$

Thus, we can construct a global proposal $q(y)$ which is independent on $X$. In this way,

$$
a(X^i, y) = \frac{p(X^i)}{p(y)} \frac{q(y \mid X^i)}{p(X^i\mid y)} = \frac{q(y \mid X^i)}{p(y)}
$$

Noticed that the rosenbrock function is symmetric on vector $(1,1)$, we can propose a new normal distribution centered at $(1,1)$, i.e. $q(y)$ is the PDF of $N(\begin{pmatrix} 1\ 1\ \end{pmatrix}, \delta^2)$.

Thus, we modify the RW-MH function and give an independence sampler named isSampler.

1
2
3
4
5
6
7
8
9
10
11
12
13
isSampler <- function(target, N, x, step, burnin=1000)
{
cov <- step*diag(2)
samples <- x
for (i in 2:(burnin+N))
{
prop <- mvrnorm(n = 1, c(1,1), cov)
if (runif(1) < min(1, target(prop)/target(x)))
x <- prop
samples <- rbind(samples,x)
}
samples[(burnin+1):(N+burnin),]
}

Directly call the new function, we will receive a set of independent samples with an image of correlation function.

1
2
3
is.samples <- isSampler(rosenbrock, 10000, c(0,0), 1)

acf(is.samples, lag.max=100)

Similarily, we can change the step parameter to get different proposals. Repeated works have been done to get the optimal distribution $N(0,0.1)$, the corresponding autocorrelation function shapes like below:

3.

Adaptive MCMC Algorithm

  1. draw a set of $N$ samples from the target distribution with standard RW-MH method.
  2. estimate the sample covariance of the samples.
  3. update the covariance matrix[^cov] of the proposal with the computed sample covariance.
  4. draw a set of samples from the updated proposal.

[^cov]: the updated covariance matrix needs to be positive.

Here, we set $\varepsilon \sim N(0, \Sigma)$ where $\Sigma$ is the updated covariance matrix:

$$
\mathrm{Cov}(X) = \frac{1}{N}\sum(X^i - \bar X^N)(X^i - \bar X^N)’ + \delta I \quad,\quad \bar X^N = \frac{1}{N}\sum_{i=1}^{N} X^i
$$

Thus, we can derive the following adaptive MCMC sampler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
adaptiveMCMC <- function(target, N, x, step, burnin=1000)
{
cov <- step*diag(2)
samples <- x
for (i in 2:(burnin+N))
{
prop <- mvrnorm(n = 1, x, cov)
if (runif(1) < min(1, target(prop)/target(x)))
x <- prop
samples <- rbind(samples,x)
cov <- cov(samples) + step*diag(2)
}
samples[(burnin+1):(N+burnin),]
}
1
2
3
adaptive.samples <- adaptiveMCMC(rosenbrock, 10000, c(0,0), 1)

acf(adaptive.samples, lag.max=100)

Tedious works have been done to obtain the optimal step size $\delta=0.5$ in the way adaptiveMCMC performs.

The self-correlations decay rapidly within $60$ lags and tend to converge after that.

Compared with standard RW-MH method, an obvious improvement can be seen in performing adaptive MCMC method. Also, we find that the optimal result is more stable in adaptiveMCMC than in rwMetro.

Finally, let’s have a look at samples generated by these three methods.

1
2
3
4
5
6
7
8
# Display the samples
par(mfrow=c(2, 2))

plot(rwMetro.samples[,1], rwMetro.samples[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='MH Samples',xlab='x', ylab='y', pch='.')

plot(is.samples[,1], is.samples[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='IS Samples',xlab='x', ylab='y', pch='.')

plot(adaptive.samples[,1], adaptive.samples[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='MCMC Samples',xlab='x', ylab='y', pch='.')

Compared with the target distribution, we find that the adaptiveMCMC samples are best simulated and rwMetro seems to be fairly standard, while the isSampler, in a way, tends to be extreme.

Furthermore, we can test our optimal result using the generated samples. For example, samples generate from adaptiveMCMC method can be used to plot the density curve and make comparison with our optimal proposal $q(y)\sim N(1,0.5)$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
x <- adaptive.samples[,1]
y <- adaptive.samples[,2]

par(mfrow=c(2, 2))

hist(x, xlab="x", main="Distribution of Samples")

plot(density(x), col="red", lwd=1.5)
curve(dnorm(x, mean=1, sd=sqrt(0.5)), add=TRUE, col="blue", lwd=1.5)

hist(x, xlab="y", main="Distribution of Samples")

plot(density(y), col="red", lwd=1.5)
curve(dnorm(x, mean=1, sd=sqrt(0.5)), add=TRUE, col="blue", lwd=1.5)

It can be seen that the proposed distribution fits well, which make our result more persuative.

Conclusion

Most simple rejection sampling methods suffer from the “curse of dimensionality”, where the probability of rejection increases exponentially as a function of the number of dimensions. But Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree. Thus they are often the only solutions available when the number of dimensions of the distribution to be sampled is high.

In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point. When the number of dimensions is high, finding the right proposed distribution to use can be difficult, as the different individual dimensions behave in very different ways, and the step size must be “just right” for all dimensions at once to avoid excessively slow mixing. An alternative approach that often works better in such situations, known as Gibbs sampling, involves choosing a new sample for each dimension separately from the others, rather than choosing a sample for all dimensions at once. Various algorithms can be used to choose these individual samples, depending on the exact form of the multivariate distribution such as the adaptive rejection Metropolis sampling algorithm.

Reference

  1. The Metropolis-Hastings algorithm Christian P. Robert (U. Paris-Dauphine PSL & U. Warwick)
  2. https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Step-by-step_instructions
  3. http://blog.csdn.net/abcjennifer/article/details/25908495
  4. https://theoreticalecology.wordpress.com/2010/09/17/metropolis-hastings-mcmc-in-r/

Kai Liu, wrote on 2017/11/13

Student ID: 116071910015

Mail: liu1995@sjtu.edu.cn

文章作者: Monad Kai
文章链接: onlookerliu.github.io/2017/11/14/Assignment-2-of-Advanced-Computational-Method/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Code@浮生记
支付宝打赏
微信打赏