Simple Complexity

Marginal likelihood is exhaustive leave-p-out cross-validation

By Samuel Belko, published on May 9, 2026.

In this post, I would like to highlight a connection between log marginal likelihood (LML) and cross-validation. In fact, LML is the same as exhaustive leave-p-out cross-validation, averaged across all train-test splits.

The derivation below follows the original proof from the paper On the marginal likelihood and cross-validation.

Consider a statistical model specified by a prior distribution π(θ) \pi(\theta) and a likelihood f(y1:nθ)f(y_{1:n} | \theta). The marginal likelihood is defined as

p(y1:n)=f(y1:nθ)π(θ)dθp(y_{1:n}) = \int f (y_{1:n} | \theta) \pi(\theta) d\theta

and quantifies the probability of the model generating the data.

By the chain rule of probability, for any permutation y~1:n\tilde{y}_{1:n} of y1:ny_{1:n} entries, we have

p(y1:n)=Πi=1np(y~iy~1:i1).p(y_{1:n}) = \Pi_{i = 1}^n p(\tilde{y}_i | \tilde{y}_{1:i-1}).

Example:

There are 3!3! permutations of y1:3=(y1,y2,y3)y_{1:3} = (y_1, y_2, y_3) entries, and each one is inducing a decomposition:

p(y1,y2,y3)=p(y1)p(y2y1)p(y3y1,y2)=p(y1)p(y3y1)p(y2y1,y3)=p(y2)p(y1y2)p(y3y1,y2)=p(y2)p(y3y2)p(y1y2,y3)=p(y3)p(y1y3)p(y2y1,y3)=p(y3)p(y2y3)p(y1y2,y3)\begin{aligned} p(y_1, y_2, y_3) &= p(y_1) p(y_2 | y_1) p(y_3 | y_1, y_2) \\ &= p(y_1) p(y_3 | y_1) p(y_2 | y_1, y_3) \\ &= p(y_2) p(y_1 | y_2) p(y_3 | y_1, y_2) \\ &= p(y_2) p(y_3 | y_2) p(y_1 | y_2, y_3) \\ &= p(y_3) p(y_1 | y_3) p(y_2 | y_1, y_3) \\ &= p(y_3) p(y_2 | y_3) p(y_1 | y_2, y_3) \end{aligned}

Applying logarithm in (2), we get

logp(y1:n)=i=1nlogp(y~iy~1:i1).\log p(y_{1:n}) = \sum_{i = 1}^n \log p(\tilde{y}_i | \tilde{y}_{1:i-1}).

Since the decomposition holds for any permutation of y1:ny_{1:n} entries, taking the arithmetic mean over all permutations {y~1:nj}j\{\tilde{y}_{1:n}^{j}\}_j yields

logp(y1:n)=1n!j=1n!i=1nlogp(y~ijy~1:i1j).\log p(y_{1:n}) = \frac{1}{n!} \sum_{j=1}^{n!} \sum_{i = 1}^n \log p(\tilde{y}_i^j | \tilde{y}_{1:i-1}^j).

To make the connection with leave-p-out cross-validation apparent, we change the order of summation in (5). We group summands together that are conditioned on the same yy entries regardless of the order of entries we are conditioning on. For instance, p(y1y2,y3)p(y_1 | y_2, y_3) and p(y1y3,y2)p(y_1 | y_3, y_2) are in the same group. We call the yy entries that we are conditioning on within a group a training set, and denote it by Dg,\mathcal{D}_g, for a group gg. Furthermore, we sort the summation by the cardinality of Dg\mathcal{D}_g.

Example:

Continuing with the previous example, each training set Dg\mathcal{D}_g has cardinality in {0,1,2}\{0,1,2\}. After grouping and sorting, we obtain the following decomposition:

logp(y1,y2,y3)=13(logp(y1)+logp(y2)+logp(y3))+16(logp(y3y1)+logp(y2y1)+logp(y1y2)+logp(y3y2)+logp(y2y3)+logp(y1y3))+13(logp(y2y1,y3)+logp(y3y1,y2)+logp(y1y2,y3))\begin{aligned} \log p(y_1, y_2, y_3) &= \frac{1}{3}\bigl( \log p(y_1) + \log p(y_2) + \log p(y_3) \bigr) \\ &\quad + \frac{1}{6}\bigl( \log p(y_3|y_1) + \log p(y_2|y_1) + \log p(y_1|y_2) \\ &\qquad\qquad\quad + \log p(y_3|y_2) + \log p(y_2|y_3) + \log p(y_1|y_3) \bigr) \\ &\quad + \frac{1}{3}\bigl( \log p(y_2|y_1,y_3) + \log p(y_3|y_1,y_2) + \log p(y_1|y_2,y_3) \bigr) \end{aligned}

Note that each summand

logp(yiDg)=logf(yiθ)π(θDg)dθ\log p(y_i | \mathcal{D}_g) = \log \int f(y_i | \theta ) \pi(\theta | \mathcal{D}_g) d\theta

is quantifying the probability of the model generating yiy_i, conditioned on training data of the group Dg\mathcal{D}_g. Hence, each summand in (5) corresponds to a cross-validation term evaluating a single test point out of p=nDgp = n - |\mathcal{D}_g| left out test points.

Therefore, we can interpret the sum (5) as an exhaustive leave-p-out cross-validation over all possible training sets and all possible left out points. This insight connects Empirical Bayes model selection with traditional cross-validation. See On the marginal likelihood and cross-validation for details.

However, even with the interpretation of LML as an exhaustive cross-validation, LML is not a suitable proxy for model generalization, as Bayesian Model Selection, the Marginal Likelihood, and Generalization argues. In the decomposition (5), we can observe that there are many terms evaluating predictive performance of the model, when conditioned on few data points. In particular, p(yi)p(y_i) terms measure purely the fit of the prior. The issue is that generalization means finding a prior such that after conditioning on some data, we obtain a posterior with a good predictive performance.

On the other hand, quite remarkably, in some models, we have an analytical formula for LML, i.e., we can efficiently evaluate an exhaustive leave-p-out cross-validation. For instance, this is the case in Gaussian Processes, assuming a Gaussian observation noise.

Thanks for reading!

References

CC BY-SA 4.0 Samuel Belko. Last modified: May 09, 2026. Website built with Franklin.jl and the Julia programming language.