Converting emcee objects to DataTree#

DataTree is the data format ArviZ relies on.

This page covers multiple ways to generate a DataTree from emcee objects.

See also

  • Conversion from Python, numpy or pandas objects

  • xarray_for_arviz for an overview of InferenceData and its role within ArviZ.

  • schema describes the structure of InferenceData objects and the assumptions made by ArviZ to ease your exploratory analysis of Bayesian models.

We will start by importing the required packages and defining the model. The famous 8 school model.

import arviz_base as az
import numpy as np
import emcee
J = 8
y_obs = np.array([28.0, 8.0, -3.0, 7.0, -1.0, 1.0, 18.0, 12.0])
sigma = np.array([15.0, 10.0, 16.0, 11.0, 9.0, 11.0, 10.0, 18.0])
def log_prior_8school(theta):
    mu, tau, eta = theta[0], theta[1], theta[2:]
    # Half-cauchy prior, hwhm=25
    if tau < 0:
        return -np.inf
    prior_tau = -np.log(tau**2 + 25**2)
    prior_mu = -((mu / 10) ** 2)  # normal prior, loc=0, scale=10
    prior_eta = -np.sum(eta**2)  # normal prior, loc=0, scale=1
    return prior_mu + prior_tau + prior_eta


def log_likelihood_8school(theta, y, s):
    mu, tau, eta = theta[0], theta[1], theta[2:]
    return -(((mu + tau * eta - y) / s) ** 2)


def lnprob_8school(theta, y, s):
    prior = log_prior_8school(theta)
    like_vect = log_likelihood_8school(theta, y, s)
    like = np.sum(like_vect)
    return like + prior
nwalkers = 40  # called chains in ArviZ
ndim = J + 2
draws = 1500
pos = np.random.normal(size=(nwalkers, ndim))
pos[:, 1] = np.absolute(pos[:, 1])
sampler = emcee.EnsembleSampler(nwalkers, ndim, lnprob_8school, args=(y_obs, sigma))
sampler.run_mcmc(pos, draws);

Manually set variable names#

This first example will show how to convert manually setting the variable names only, leaving everything else to ArviZ defaults.

# define variable names, it cannot be inferred from emcee
var_names = ["mu", "tau"] + ["eta{}".format(i) for i in range(J)]
idata1 = az.from_emcee(sampler, var_names=var_names)
idata1
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*

ArviZ has stored the posterior variables with the provided names as expected, but it has also included other useful information in the InferenceData object. The log probability of each sample is stored in the sample_stats group under the name lp and all the arguments passed to the sampler as args have been saved in the observed_data group.

It can also be useful to perform a burn in cut to the MCMC samples (see :meth:arviz.InferenceData.sel for more details)

#idata1.sel(draw=slice(100, None))

From an InferenceData object, ArviZ’s native data structure, the posterior plot of a few variables can be done in one line:

#az.plot_posterior(idata1, var_names=["mu", "tau", "eta4"])

Structuring the posterior as multidimensional variables#

This way of calling from_emcee stores each eta as a different variable, called eta#, however, they are in fact different dimensions of the same variable. This can be seen in the code of the likelihood and prior functions, where theta is unpacked as:

mu, tau, eta = theta[0], theta[1], theta[2:]

ArviZ has support for multidimensional variables, and there is a way to tell it how to split the variables like it was done in the likelihood and prior functions:

idata2 = az.from_emcee(sampler, slices=[0, 1, slice(2, None)])
idata2
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*

After checking the default variable names, the trace of one dimension of eta can be plotted using ArviZ syntax:

#az.plot_trace(idata2, var_names=["var_2"], coords={"var_2_dim_0": 4});

blobs: unlock sample stats, posterior predictive and miscellanea#

Emcee does not store per-draw sample stats, however, it has a functionality called blobs that allows to store any variable on a per-draw basis. It can be used to store some sample_stats or even posterior_predictive data.

You can modify the probability function to use this blobs functionality and store the pointwise log likelihood, then rerun the sampler using the new function:

def lnprob_8school_blobs(theta, y, s):
    prior = log_prior_8school(theta)
    like_vect = log_likelihood_8school(theta, y, s)
    like = np.sum(like_vect)
    return like + prior, like_vect


sampler_blobs = emcee.EnsembleSampler(
    nwalkers,
    ndim,
    lnprob_8school_blobs,
    args=(y_obs, sigma),
)
sampler_blobs.run_mcmc(pos, draws);

You can now use the blob_names argument to indicate how to store this blob-defined variable. As the group is not specified, it will go to sample_stats. Note that the argument blob_names is added to the arguments covered in the previous examples and we are also introducing coords and dims arguments to show the power and flexibility of the converter. For more on coords and dims see page_in_construction.

dims = {"eta": ["school"], "log_likelihood": ["school"]}
idata3 = az.from_emcee(
    sampler_blobs,
    var_names=["mu", "tau", "eta"],
    slices=[0, 1, slice(2, None)],
    blob_names=["y"],
    dims=dims,
    coords={"school": range(8)},
)
idata3
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*

Multi-group blobs#

You might even have more complicated blobs, each corresponding to a different group of the InferenceData object. Moreover, you can store the variables passed to the EnsembleSampler via the args argument in observed or constant data groups. This is shown in the example below:

sampler_blobs.blobs[0, 1]
array([-3.32784122e+00, -5.76423166e-01, -4.26615268e-02, -3.53483214e-01,
       -2.50328863e-02, -2.75246231e-03, -3.12989363e+00, -4.08660379e-01])
def lnprob_8school_blobs(theta, y, sigma):
    mu, tau, eta = theta[0], theta[1], theta[2:]
    prior = log_prior_8school(theta)
    like_vect = log_likelihood_8school(theta, y, sigma)
    like = np.sum(like_vect)
    # store pointwise log likelihood, useful for model comparison with az.loo or az.waic
    # and posterior predictive samples as blobs
    return like + prior, (like_vect, np.random.normal((mu + tau * eta), sigma))


sampler_blobs = emcee.EnsembleSampler(
    nwalkers,
    ndim,
    lnprob_8school_blobs,
    args=(y_obs, sigma),
)
sampler_blobs.run_mcmc(pos, draws)

dims = {"eta": ["school"], "log_likelihood": ["school"], "y": ["school"]}
idata4 = az.from_emcee(
    sampler_blobs,
    var_names=["mu", "tau", "eta"],
    slices=[0, 1, slice(2, None)],
    arg_names=["y", "sigma"],
    arg_groups=["observed_data", "constant_data"],
    blob_names=["y", "y"],
    blob_groups=["log_likelihood", "posterior_predictive"],
    dims=dims,
    coords={"school": range(8)},
)
idata4
<xarray.DatasetView>
Dimensions:  ()
Data variables:
    *empty*

This last version, which contains both observed data and posterior predictive can be used to plot posterior predictive checks:

#az.plot_ppc(idata4, var_names=["y"], alpha=0.3, num_pp_samples=200);
%load_ext watermark
%watermark -n -u -v -iv -w
Last updated: Fri Jun 16 2023

Python implementation: CPython
Python version       : 3.10.11
IPython version      : 8.14.0

arviz_base: 0.1
numpy     : 1.24.3
emcee     : 3.1.4

Watermark: 2.3.1