DataTree for Exploratory Analysis of Bayesian Models#

Here we present a collection of common manipulations you can use while working with :class:datatree.DataTree.

import arviz_base as az
from datatree import DataTree
import numpy as np
import xarray as xr

xr.set_options(display_expand_data=False, display_expand_attrs=False);

display_expand_data=False makes the default view for xarray.DataArray fold the data values to a single line. To explore the values, click on the icon on the left of the view, right under the xarray.DataArray text. It has no effect on Dataset objects that already default to folded views.

display_expand_attrs=False folds the attributes in both DataArray and Dataset objects to keep the views shorter. In this page we print DataArrays and Datasets several times and they always have the same attributes.

Obtain a NumPy array for a given parameter#

Let’s say we want to get the values for mu as a NumPy array.

stacked.mu.values

array([7.87179637, 3.38455431, 9.10047569, ..., 1.76673325, 3.48611194,
       3.40446391])

Get the dimension lengths#

Let’s check how many groups are in our hierarchical model.

len(idata.observed_data.school)

Get coordinate values#

What are the names of the groups in our hierarchical model? You can access them from the coordinate name school in this case

idata.observed_data.school

<xarray.DataArray 'school' (school: 8)>
'Choate' 'Deerfield' 'Phillips Andover' ... "St. Paul's" 'Mt. Hermon'
Coordinates:
  * school   (school) object 'Choate' 'Deerfield' ... "St. Paul's" 'Mt. Hermon'

Compute posterior mean values along `draw` and `chain` dimensions#

To compute the mean value of the posterior samples, do the following:

This computes the mean along all dimensions. This is probably what you want for mu and tau, which have two dimensions (chain and draw), but maybe not what you expected for theta, which has one more dimension school.

You can specify along which dimension you want to compute the mean (or other functions).

Compute and store posterior pushforward quantities#

We use “posterior pushfoward quantities” to refer to quantities that are not variables in the posterior but deterministic computations using posterior variables.

You can use xarray for these pushforward operations and store them as a new variable in the posterior group. You’ll then be able to plot them with ArviZ functions, calculate stats and diagnostics on them (like the mcse) or save and share the inferencedata object with the pushforward quantities included.

Compute the rolling mean of \(\log(\tau)\) with xarray.DataArray.rolling, storing the result in the posterior

post["mlogtau"] = post["log_tau"].rolling({"draw": 50}).mean()

Using xarray for pusforward calculations has all the advantages of working with xarray. It also inherits the disadvantages of working with xarray, but we believe those to be outweighed by the advantages, and we have already shown how to extract the data as NumPy arrays. Working with InferenceData is working mainly with xarray objects and this is what is shown in this guide.

Some examples of these advantages are specifying operations with named dimensions instead of positional ones (as seen in some previous sections), automatic alignment and broadcasting of arrays (as we’ll see now), or integration with Dask (as shown in the dask_for_arviz guide).

In this cell you will compute pairwise differences between schools on their mean effects (variable theta). To do so, substract the variable theta after renaming the school dimension to the original variable. Xarray then aligns and broadcasts the two variables because they have different dimensions, and the result is a 4d variable with all the pointwise differences.

Eventually, store the result in the theta_school_diff variable:

post["theta_school_diff"] = post.theta - post.theta.rename(school="school_bis")

Note

This same operation using NumPy would require manual alignment of the two arrays to make sure they broadcast correctly. The could would be something like:

theta_school_diff = theta[:, :, :, None] - theta[:, :, None, :]

The theta_shool_diff variable in the posterior has kept the named dimensions and coordinates:

Add new chains using concat#

After checking the mcse and realizing you need more samples, you rerun the model with two chains and obtain an idata_rerun object.

# once implemented
# idata.merge(idata_rerun)

DataTree for Exploratory Analysis of Bayesian Models#

Get a specific group#

Add a new variable#

Combine chains and draws#

Get a random subset of the samples#

Obtain a NumPy array for a given parameter#

Get the dimension lengths#

Get coordinate values#

Get a subset of chains#

Remove the first n draws (burn-in)#

Compute posterior mean values along `draw` and `chain` dimensions#

Compute and store posterior pushforward quantities#

Advanced subsetting#

Add new chains using concat#

Add a new group to a DataTree#

DataTree for Exploratory Analysis of Bayesian Models#

Get a specific group#

Add a new variable#

Combine chains and draws#

Get a random subset of the samples#

Obtain a NumPy array for a given parameter#

Get the dimension lengths#

Get coordinate values#

Get a subset of chains#

Remove the first n draws (burn-in)#

Compute posterior mean values along draw and chain dimensions#

Compute and store posterior pushforward quantities#

Advanced subsetting#

Add new chains using concat#

Add a new group to a DataTree#

Compute posterior mean values along `draw` and `chain` dimensions#