Implementing an MCMC Sampler for Bayesian Inference

18 Jan 2024

Imagine one of the most common problems in statistics. We have two distinct data samples and we want to analyze how likely they come from the same population.

More specifically, both samples are series of binary outcomes $d_{ij} \in {0, 1}$ generated from two Bernoulli distributions with unknown parameters $\theta_1, \theta_2$. Are the true values of the parameters equal $\theta_1 = \theta_2$? If yes, then the groups come from the same population.

We will use Bayesian modeling to answer this question. This approach provides us with a probability distribution $p(\theta_1,\theta_2|D)$ that tells us which parameters more likely generated our observed data $D$. We can compute it as

\[p(\theta_1,\theta_2|D) = \frac{p(D|\theta_1,\theta_2)p(\theta_1,\theta_2)}{p(D)}\]

which is just an application of Bayes’ theorem. The distribution on the left is called posterior because we get it by using the data $D$ to update our prior beliefs $p(\theta_1, \theta_2)$.

Data

First we create some random data. As you can see in the code snippet, the distributions we are using to generate the samples are indeed different beacuse their parameters are $\theta_1^{\text{true}} = 0.1, \theta_2^{\text{true}} = 0.13$. The Bernoulli distribution parameter is usually called $p$ in NumPy and PyMC. However, I am using $\theta$ for the parameter name to avoid writing confusing notation such as $p(p_1, p_2,|D)$.

import numpy as np
import pandas as pd

size = 2000
rng = np.random.default_rng(1)
sample_1 = rng.binomial(n=1, p=0.1, size=size)
sample_2 = rng.binomial(n=1, p=0.13, size=size)

data = pd.DataFrame(
    {
        "y": np.concatenate([sample_1, sample_2]),
        "variant": ["1"] * size + ["2"] * size,
    }
)

Using PyMC

PyMC is a Python library for Bayesian modeling. The code example below shows the implementation of our model in PyMC.

import pymc as pm

variant_mask, variant_ids = pd.factorize(data["variant"])
model = pm.Model(coords={"group": ["1", "2"]})
with model:

    theta = pm.Uniform("theta", 0, 1, dims="group")
    likelihood = pm.Bernoulli(
        "likelihood", p=theta[variant_mask], observed=data["y"]
    )

The line model = pm.Model(coords={"group": ["1", "2"]}) tells PyMC that there are two distinct groups in the dataset. The first line in the with block defines the prior distribution of the parameters $p(\theta_1, \theta_2)$ as a uniform distribution between 0 and 1. The next line is the likelihood $p(D|\theta_1, \theta_2)$ of our observed data $D$ given some $\theta_1, \theta_2$.

Running the inference as

import time

t_0 = time.time()
with model:
    inference = pm.sample(5000, random_seed=np.random.default_rng(1))
print(f"{time.time() - t_0:.2f} seconds")

produces 4 chains of 5000 random pairs of $\theta_1, \theta_2$ that follow the posterior probability distribution $p(\theta_1, \theta_2,|D)$. This means that the parameter values that better describe the observed data should appear more frequently in the sampled chains.

The problem is that it takes around 60 seconds to run the sampling on my laptop. That is just way too slow for a model with just 2 parameters and 4000 data points. In the next step, we will try to implement our own inference and sampling that will hopefully be faster than PyMC.

Metropolis Sampling

Random pairs of parameters drawn from a specific posterior distribution can be generated using MCMC (Markov chain Monte Carlo) methods. One of them is called Metropolis algorithm and works like this:

Start with a random current parameter pair $\theta_1^C, \theta_2^C$.
Propose a set of new parameters $\theta_1^P, \theta_2^P$. In our case, the proposal is $ \theta_i^P = \theta_i^C + \Delta \sim \mathcal{N}(0, 0.01)$.
Generate a random uniform number $v$ between 0 and 1. Add the proposal to the chain and set it as the current pair if $\frac{p(\theta_1^P, \theta_2^P|D)}{p(\theta_1^C, \theta_2^C|D)} > v$. Otherwise, the current parameters are added to the chain.

As you can see, the algorithm requires that we are able to evaluate the posterior probability density function $p(\theta_1, \theta_2|D)$. Because the parameters as well as the data points are independent, we can calculate the density as

\[p(\theta_1,\theta_2|D) = \frac{p(D|\theta_1,\theta_2)p(\theta_1,\theta_2)}{p(D)} = \frac{p(\theta_1, \theta_2)\prod_{i=1}^2\prod_{j=1}^N p(d_{ij}|\theta_1, \theta_2)}{p(D)}\]

where $d_{ij}$ is the $j$-th data point in the $i$-th sample group. What about $p(D)$? We can find its value by integrating over the parameter space as

\[p(D) = \int_0^1 \int_0^1 p(D|\theta'_1, \theta'_2)p(\theta'_1,\theta'_2)\textrm{d}\theta'_1\textrm{d}\theta'_2\]

Fortunately, we are interested only in the fraction $\frac{p(\theta_1^P, \theta_2^P|D)}{p(\theta_1^C, \theta_2^C|D)}$ and we don’t have to calculate $p(D)$ at all because it gets canceled out

\[\frac{p(\theta_1^P, \theta_2^P|D)}{p(\theta_1^C, \theta_2^C|D)} = \frac{p(D|\theta_1^P,\theta_2^P)p(\theta_1^P,\theta_2^P)}{p(D|\theta_1^C,\theta_2^C)p(\theta_1^C,\theta_2^C)} = \frac{p(\theta_1^P, \theta_2^P)\prod_{i=1}^2\prod_{j=1}^N p(d_{ij}|\theta_1^P, \theta_2^P)}{ p(\theta_C^1, \theta_C^2)\prod_{i=1}^2\prod_{j=1}^N p(d_{ij}|\theta_1^C, \theta_2^C)}\]

However, there is another complication. To find the value of the fraction, we need to multiply together thousands of numbers between 0 and 1. If we tried this in Python, we would soon run into problems with float precision. The likelihood $p(D|\theta_1,\theta_2)$ would end up being zero. You can try it yourself by typing the following snippet into a Python terminal:

p = 1
for _ in range(1000):
    p = p * 0.1
print(p)  # is 0.0

We can work around this issue by applying logarithms to transform the multiplications to additions of negative numbers.

\[\begin{gather*} \frac{p(\theta_1^P, \theta_2^P|D)}{p(\theta_1^C, \theta_2^C|D)} = \exp(\log(p(D|\theta_1^P,\theta_2^P)p(\theta_1^P,\theta_2^P)) - \log(p(D|\theta_1^C,\theta_2^C)p(\theta_1^C,\theta_2^C))) \\ = \exp\left( \log p(\theta_1^P,\theta_2^P) + \sum_{i=1}^2\sum_{j=1}^N \log p(d_{ij}|\theta_1^P, \theta_2^P) - \log p(\theta_1^C,\theta_2^C) - \sum_{i=1}^2\sum_{j=1}^N \log p(d_{ij}|\theta_1^C, \theta_2^C) \right) \\ = \exp(\log p_{\text{likelihood}}^P p_{\text{prior}}^P - \log p_{\text{likelihood}}^C p_{\text{prior}}^C) \end{gather*}\]

Because we are dealing with Bernoulli distributions where the probability of a positive and negative outcomes is $\theta$ and $1 - \theta$ respectively, we can write $\log p_{\text{likelihood}} p_{\text{prior}}$ as

\[\log p(\theta_1, \theta_2) + z_1 \log \theta_1 + (N_1 - z_1) \log(1 - \theta_1) + z_2 \log \theta_2 + (N_2 - z_2) \log(1 - \theta_2)\]

where $z_1$ is the count of positive outcomes in the group 1 out of total number of events $N_1$. Let’s further simplify it. We know the parameters are independent, this means the prior can be rewritten as

\[\log p(\theta_1, \theta_2) = \log(p(\theta_1)p(\theta_2)) = \log p(\theta_1) + \log p(\theta_2)\]

In our case the prior is a uniform distribution between 0 and 1. Its probability density function is $p(\theta) = \frac{1}{a - b} = \frac{1}{1 - 0} = 1$ for any possible value of $\theta$. This gives us

\[\log p(\theta_1, \theta_2) = \log p(\theta_1) + \log p(\theta_2) = 2\log(1) = 0\]

we can just ignore the prior term as it’s always 0. The final formula is

\[\log p_{\text{likelihood}} p_{\text{prior}} = z_1 \log \theta_1 + (N_1 - z_1) \log(1 - \theta_1) + z_2 \log \theta_2 + (N_2 - z_2) \log(1 - \theta_2)\]

Now it’s time to rewrite this into Python. We will have three functions converted from the descriptions and formulas above. log_likelihood_prior calculates $\log p_{\text{likelihood}} p_{\text{prior}}$ for either the current or proposed parameters based on the data $D$ consisting of $z_1,N_1,z_2,N_2$.

def log_w_eps(x: float) -> float:
    return np.log(x + 1e-20)  # we want logs for zeros as well 

def log_likelihood_prior(
    theta_1: float, theta_2: float, N_1: int,
    N_2: int, z_1: int, z_2: int,
) -> float:
    
    return (
        (z_1 * log_w_eps(theta_1))
        + ((N_1 - z_1) * log_w_eps(1 - theta_1))
        + (z_2 * log_w_eps(theta_2))
        + ((N_2 - z_2) * log_w_eps(1 - theta_2))
    )

The next function is propose that generates the proposed parameters $\theta_1^P, \theta_2^P$. And the last one is metropolis_sampling that runs the three step algorithm described at the beginning of this section and produces a single sample chain with n_steps pairs of parameters.

def propose(current: np.array) -> np.array:
    mean = np.array([0, 0])
    covar = np.diag([0.01, 0.01])
    proposed_delta = np.random.multivariate_normal(mean, covar)
    return np.clip(current + proposed_delta, 0, 1)

def metropolis_sampling(
    log_likelihood_prior: callable,
    start: list,
    n_steps: int,
) -> np.array:
    
    chain = np.empty((n_steps, 2), dtype=np.float64)
    chain[0, :] = start

    for i in range(1, n_steps - 1):
        current = chain[i - 1]
        proposal = propose(current)
        prob_move = np.exp(
            log_likelihood_prior(*proposal)
            - log_likelihood_prior(*current)
        )

        if np.random.rand() < prob_move:
            chain[i, :] = proposal
        else:
            chain[i, :] = current

    return chain

Inference

Our Metropolis sampler runs significantly faster than the PyMC model from the beginning of this post. It takes around 10 seconds to calculate a single 100,000 pair chain.

from functools import partial

f = partial(
    log_likelihood_prior,
    N_1=len(sample_1), N_2=len(sample_2),
    z_1=sum(sample_1), z_2=sum(sample_2),
)

t_0 = time.time()
np.random.seed(1)
chain = metropolis_sampling(f, [0.1, 0.1], 100_000)
print(f"{time.time() - t_0:.2f} seconds")

The figure below shows the final results. On the left is the posterior distribution of pairwise differences $\theta_2 - \theta_1$ from the PyMC model. On the right is the Metropolis algorithm. Both are very similar with the centers close to the true difference $\theta_2^{\text{true}} - \theta_1^{\text{true}} = 0.03$. The highest density intervals (HDI) are almost identical as well.

Posterior

It is also trivial to run metropolis_sampling in parallel using multiprocessing to produce multiple chains at the same time.

import multiprocess as mp

with mp.Pool(4) as p:  # 4 chains in parallel
    samples = np.concatenate(
        p.map(
            lambda _: metropolis_sampling(f, [0.1, 0.1], 100_000),
            range(4),
        )
    )

Notebook with all the code is available on my GitHub.