Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

MET-SVGD in a Nutshell

Problem Setup

We consider a target density known only up to normalization

p(x)=pˉ(x)Z,p(x) = \frac{\bar{p}(x)}{Z},

where ZZ is intractable. Our objective is to approximate pp and estimate functionals of pp, particularly its entropy H(p)\mathcal{H}(p).

Problem Significance

Target distributions known only up to a normalization constant arise throughout machine learning. A prominent example is Maximum-Entropy Reinforcement Learning (MaxEnt RL), which augments the standard reinforcement learning objective

JRL(π)=Eτπ[t=0Tr(st,at)]J_{\mathrm{RL}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T} r(s_t, a_t)\right]

with an entropy regularization term,

JMaxEnt(π)=Eτπ[t=0T(r(st,at)+αH(π(st)))],J_{\mathrm{MaxEnt}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T}\big(r(s_t, a_t) + \alpha\,\mathcal{H}(\pi(\cdot\mid s_t))\big)\right],

where H(π(s))\mathcal{H}(\pi(\cdot\mid s)) denotes the policy entropy and α\alpha controls the reward-entropy trade-off.

Without the entropy term, the optimal solution is a deterministic policy that selects a single highest-return action in each state. In contrast, maximizing the entropy-regularized objective yields a stochastic policy that assigns probability mass to multiple high-return actions. In fact, the optimal MaxEnt policy takes the form

π(as)=exp(Q(s,a)/α)Z,\pi(a\mid s) = \frac{\exp(Q(s,a)/\alpha)}{Z},

which is an energy-based distribution over the Q-values with an intractable normalization constant.

The resulting stochasticity provides a significant robustness advantage. Rather than committing to a single trajectory, the agent learns a distribution over multiple high-reward behaviors. Consequently, if the environment changes at test time, the policy can exploit alternative strategies that were also assigned non-negligible probability during training.

Environment at train time (No obstacles).

(a)Environment at train time (No obstacles).

Environment at test time.

(b)Environment at test time.

Figure 1:A robot navigating a maze. Ref: https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/.

The figure below illustrates this effect. A deterministic policy is likely to fail because it has committed to a single behavior. In contrast, a maximum-entropy policy can adapt by exploiting an alternative route that was already represented in its action distribution.

This robustness comes at a cost: evaluating the MaxEnt objective requires estimating the entropy

H(π)=Eaπ[logπ(a)],\mathcal{H}(\pi) = -\mathbb{E}_{a \sim \pi}[\log \pi(a)],

yet π\pi is only available through an unnormalized density.

The Challenge

The primary challenge is that the normalization constant ZZ is unknown, making direct evaluation of

p(x)=pˉ(x)Zp(x) = \frac{\bar{p}(x)}{Z}

intractable. While ZZ can be computed analytically for simple distributions such as Gaussians, it is generally unavailable for the complex, high-dimensional distributions encountered in modern machine learning.

Existing approaches fall into two broad categories:

Ideally, we seek a method that constructs an approximation qq that:

MET-SVGD

Metropolis-Hastings Stein Variational Gradient Descent (MET-SVGD) addresses the above challenges by deriving a tractable density representation for Stein Variational Gradient Descent (SVGD) Liu & Wang, 2016. This enables direct entropy estimation while retaining SVGD’s ability to draw samples from complex target distributions.

More broadly, MET-SVGD unifies three major paradigms for approximate inference: Stein Variational Gradient Descent (SVGD), parametric variational inference (P-VI), and Metropolis-Hastings (MH). As a result, it inherits key advantages from each:

A key contribution of MET-SVGD is that it enables SVGD to scale to high-dimensional distributions while preserving computational efficiency and multimodal expressivity.

Furthermore, MET-SVGD replaces expensive hyperparameter search with end-to-end optimization of sampler parameters through KL-divergence minimization, enabling automatic adaptation of the inference procedure to the target distribution.

Finally, MET-SVGD can be interpreted as a full-rank normalizing flow whose layers are induced by successive SVGD updates. Unlike conventional flows with a fixed depth, the number of transformations is determined adaptively through a convergence criterion, yielding a flexible and highly expressive density model.

MET-SVGD bridges the gap between P-VI, SVGD, and MCMC methods.

Figure 2:MET-SVGD bridges the gap between P-VI, SVGD, and MCMC methods.

Table 1:MET-SVGD inherits the advantages of different approximate inference methods

P‑VIMCMCSVGDP‑SVGDMET‑SVGD
Expressivity✓✓
Convergence Detection
Convergence Guarantees
Sampling Efficiency
Tractable Entropy
Parameter Efficiency✓✓✓✓
References
  1. Liu, Q., & Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. NeurIPS.