Scalable Computation for Bayesian Hierarchical Models

We want to probe a posterior density. To probe is to marginalise, maximize, or simulate. Importatnt dimensions are \(N\) and \(p\), the sample size and parameter space. Scalable computation probes with given precision at cost and scales linearly (ideally).

Current Themes in Research

  1. Variable selection
  2. Latent Gaussian Models
  3. Bayesian Hierarchical Models

Structure 1: Crossed-effect Models

\[y_{i_1,\ldots,i_k} \sim \text{N}(a^{(0)} + a_{i_1}^{(1)} + \cdots + a_{i_k}^{(K)},(n_{i_1,\ldots,i_k}\tau_0)^{-1})\]

with

\[p(a^{(0)}) \propto 1, \quad a_j^{(k)} \sim \text{N}(0,1/\tau_k)\]

There could be additional priors on the variance components, and \(i_k = 1,\ldots,I_k\). Also, \(N = \sim_{i_1,\ldots,I_k} n_{i_1,\ldots,I_k}, \quad p = \sum_k I_k\).

We can look at this as a probabalistic recommender system. The special case where \(k=2\) is interesting both in terms of intuition and recommender-type applications. \(y_{ij}\) is the rating of “customer” \(i\) on “product” \(j\), \(a_{i}^{(1)}\) is a customer effect, and \(a_j^{(2)}\) is a product effect, and \(n_{ij}\) is 1 if the customer has rated the product, and 0 otherwise. Typically, data of this type are very sparse.

For simplicity, consider a two-level model for the mean of a regression model.

Structure 2: Nested Multilevel Regression Models

Regression, where the mean prior is also a regression. If there is only one level here, this is a dynamic linear model. Also, it is practically important to allow \(\Sigma_{i_1,\ldots,i_{d-1}}\) to be low-rank (partly deterministic dynamics, non-nested structures).

MCMC Convergence

A Markov Chain with transition kernel \(P\) and set of eigenvalues \(S\) converges if

\[\begin{split}\lambda^* = \sup\{S - \{1\}\} < 1\end{split}\]

The smaller this value, the better. This is the same as when the spectral gap is closer to 1, the larger the better. The relaxation time of a Markov Chain is defined to be

\[\frac{1}{1 - \lambda^*}\]

and can be interpreted as the number of iterations needed to subsample the Markov chain so that the resultant draws are roughly independent of each other.

Scalable Gibbs Sampling for Crossed-Effect Models

He studied two implementations, the Gibbs Sampler and the collapsed Gibbs Sampler (cGS), with results about the scalability of the MCMC.

Note: Balanced levels mean that all levels are equally represented. Balanced cells mean that the data have entries for all subjects and levels.

Results show that unbalanced designs show that the relaxation time for a GS is of order \(\sqrt{N}\), and that of the cGS is of order 1, so there are overall complexities of \(N^{3/2}\) and \(N\), respectively. The results show a much better efficiency for the collapsed Gibbs Sampler, as shown with the effective sample size per second for various parameters. How does the acf look? The effective sample size can be inflated with negative values for the autocorrelation in the MCMC chain.

The Gibbs Sampler for corssed effects uses the normal conjugacy to draw from the posterior distribution. Under the assumption of balanced cells, there is a linear algebra collapsing that can be done to greatly speed things up. This reduces the order from \(p^2\) to \(p\).

The Collapsed Gibbs Sampler (cGS)

This might be a good idea because

\[\tilde{y} \approx \sum_k \bar{a}^{(k)},\]

hence the joint posterior collapses to a \(1 - d\) lower manifold. This means we basically integrate out the local mean. This is a similar trick to using contrasts.

Myths and Facts about cGS

There is a widespread conviction that collapsing is good. The idea here is that doing more math when possible can only make the algorithm more efficient, also sampling from smaller-dimensional targets has to be more efficient. This is bull.

There is no theory that justifies the claims. Omiros gives an examples that is counter to this idea.

Theorem 1 Let \(\{a(t)\}\) be the Markov chain generated either by the GS or the cGS for balanced-levels designs. Then, then timewise transformations \(\{\bar{a}(t)\}\) and \(\{\delta a(t)\}\) obtained from \(\{a(t)\}\)... (running out of time and speeding up)

Theorem 2 For balanced-cells designs, the relaxation time of the GS is \(1 + \max_{k=1,\ldots,K} \frac{N_{\tau_0}}{I_k\tau_k}\) and that of the cGS is 1.

Theorem 3 For balanced-levels designs with \(K=2\), the rate of convergence of the GS... (running out of time)

Basically, the cGS is always faster and has good mixing for both small and large data. The relaxation time is not uniformly bounded in \(N\). In a typical situation, cGS is scalable. GS can be scalable, but in a sparse regime, it is not.

Scalable Computation of Nested Multilevel Hierarchical Models

This structure is amenable to efficient sampling using belief propagation. The basic idea here is that messages incode marginalizatons (marginal-likelihood type calculations), and can massively save time.

Last Thoughts

There are many linear algebra methods that speed up calculations significantly.