0

I was following this nice demo on updating from posteriors. It provides the following function to reconstruct a posterior:

def from_posterior(param, samples):
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 100)
    kde = stats.gaussian_kde(samples)
    y = kde(x)

    # what was never sampled should have a small probability but not 0,
    # so we'll extend the domain and use linear approximation of density on it
    x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
    y = np.concatenate([[0], y, [0]])
    return Interpolated(param, x, y)

However, I'm trying to reconstruct a multidimensional distribution. So not alpha = Normal("alpha", mu=0, sigma=1) but alpha = Normal("alpha", mu=0, sigma=1, size=(100)). Then the above function does not work. I tried to extend in two ways.

First I tried the neat way, but I was not able to get the data in the right format from gaussian_kde.

samples = trace["alpha"]
print("samples", samples.shape)
smin, smax = samples.min(axis=0), samples.max(axis=0)
width = smax - smin
print("width", width.shape)
x = np.linspace(smin, smax, 98, axis=1)
print("x", x.shape)
kde = stats.gaussian_kde(samples.T)
y = kde.evaluate(x)
print("y", y.shape)

Then I tried the loopy way, but this fails as well because x is not increasing in all axis.

def from_posterior(samples):
#     print(samples.shape)
    smin, smax = np.min(samples), np.max(samples)
    width = smax - smin
    x = np.linspace(smin, smax, 98)
#     print(x.shape)
    kde = stats.gaussian_kde(samples)
    y = kde(x)
#     print(y.shape)
    # what was never sampled should have a small probability but not 0,
    # so we'll extend the domain and use linear approximation of density on it
    x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
    y = np.concatenate([[0], y, [0]])
    return x, y

def from_posterior_in_loop(param, samples):
    all_x, all_y = [], []
    for s in range(samples.shape[1]):
        x, y = from_posterior(samples[:, s])
        all_x.append(x)
        all_y.append(y)
    return Interpolated(param, np.concatenate(all_x), np.concatenate(all_y))

with pm.Model() as model: 
    from_posterior_in_loop(alpha, trace["alpha"]):

Ideas are welcome. I don't necessarily need to reconstruct the posterior from a trace, if there is another method to iteratively fit a big dataset?

Roelant
  • 4,508
  • 1
  • 32
  • 62
  • I've [commented previously](https://stackoverflow.com/a/53367519/570918) on being skeptical of KDE-based updating, instead advocating that if a conjugate model is possible, it should be preferred. In higher dimensions this is further aggravated by naively requiring a number of approximation parameters that grows exponentially with dimensions. Of course, structured interpolations are possible, but I'm skeptical that a custom updating function with something like a multi-dimensional Gaussian would do any better than variational inference with minibatches (which is already implemented). – merv Feb 21 '22 at 07:23
  • 1
    Here's [the tutorial using the VI API](https://docs.pymc.io/en/v3/pymc-examples/examples/variational_inference/variational_api_quickstart.html), which includes minibatches for large datasets at the end. – merv Feb 21 '22 at 07:30
  • Ah that's exactly why I need - thanks a lot (if you want I would accept it as an answer). Could you explain the remark "Multidimensional minibatches may be needed for some corner cases where you do matrix factorization or model is very wide" and do you have an example for this? – Roelant Feb 22 '22 at 08:17

0 Answers0