# Multivariate Bijections

## Background​

The fundamental idea of normalizing flows also applies to multivariate random variables, and this is where its value is clearly seen - representing complex high-dimensional distributions. In this case, a simple multivariate source of noise, for example a standard i.i.d. normal distribution, $X\sim\mathcal{N}(\mathbf{0},I_{D\times D})$, is passed through a vector-valued bijection, $g:\mathbb{R}^D\rightarrow\mathbb{R}^D$, to produce the more complex transformed variable $Y=g(X)$.

Sampling $Y$ is again trivial and involves evaluation of the forward pass of $g$. We can score $Y$ using the multivariate substitution rule of integral calculus,

\begin{aligned} \mathbb{E}_{p_X(\cdot)}\left[f(X)\right] &= \int_{\text{supp}(X)}f(\mathbf{x})p_X(\mathbf{x})d\mathbf{x}\\ &= \int_{\text{supp}(Y)}f(g^{-1}(\mathbf{y}))p_X(g^{-1}(\mathbf{y}))\det\left|\frac{d\mathbf{x}}{d\mathbf{y}}\right|d\mathbf{y}\\ &= \mathbb{E}_{p_Y(\cdot)}\left[f(g^{-1}(Y))\right], \end{aligned}

where $d\mathbf{x}/d\mathbf{y}$ denotes the Jacobian matrix of $g^{-1}(\mathbf{y})$. Equating the last two lines we get,

\begin{aligned} \log(p_Y(y)) &= \log(p_X(g^{-1}(y)))+\log\left(\det\left|\frac{d\mathbf{x}}{d\mathbf{y}}\right|\right)\\ &= \log(p_X(g^{-1}(y)))-\log\left(\det\left|\frac{d\mathbf{y}}{d\mathbf{x}}\right|\right). \end{aligned}

Inituitively, this equation says that the density of $Y$ is equal to the density at the corresponding point in $X$ plus a term that corrects for the warp in volume around an infinitesimally small volume around $Y$ caused by the transformation. For instance, in $2$-dimensions, the geometric interpretation of the absolute value of the determinant of a Jacobian is that it represents the area of a parallelogram with edges defined by the columns of the Jacobian. In $n$-dimensions, the geometric interpretation of the absolute value of the determinant Jacobian is that is represents the hyper-volume of a parallelepiped with $n$ edges defined by the columns of the Jacobian (see a calculus reference such as [7] for more details).

Similar to the univariate case, we can compose such bijective transformations to produce even more complex distributions. By an inductive argument, if we have $L$ transforms $g_{(0)}, g_{(1)},\ldots,g_{(L-1)}$, then the log-density of the transformed variable $Y=(g_{(0)}\circ g_{(1)}\circ\cdots\circ g_{(L-1)})(X)$ is

\begin{aligned} \log(p_Y(y)) &= \log\left(p_X\left(\left(g_{(L-1)}^{-1}\circ\cdots\circ g_{(0)}^{-1}\right)\left(y\right)\right)\right)+\sum^{L-1}_{l=0}\log\left(\left|\frac{dg^{-1}_{(l)}(y_{(l)})}{dy'}\right|\right), \end{aligned}

where we've defined $y_{(0)}=x$, $y_{(L-1)}=y$ for convenience of notation.

The main challenge is in designing parametrizable multivariate bijections that have closed form expressions for both $g$ and $g^{-1}$, a tractable Jacobian whose calculation scales with $O(D)$ or $O(1)$ rather than $O(D^3)$, and can express a flexible class of functions.

## Multivariate Bijectors​

In this section, we show how to use bij.SplineAutoregressive to learn the bivariate toy distribution from our running example. Making a simple change we can represent bivariate distributions of the form, $p(x_1,x_2)=p(x_1)p(x_2|x_1)$:

dist_x = torch.distributions.Independent(  torch.distributions.Normal(torch.zeros(2), torch.ones(2)),   1)bijector = bij.SplineAutoregressive()dist_y = dist.Flow(dist_x, bijector)

The bij.SplineAutoregressive bijector extends bij.Spline so that the spline parameters are the output of an autoregressive neural network. See [durkan2019neural] and [germain2015made] for more details.

Similarly to before, we train this distribution on the toy dataset and plot the results:

dataset = torch.tensor(X, dtype=torch.float)optimizer = torch.optim.Adam(dist_y.parameters(), lr=5e-3)for step in range(steps):    optimizer.zero_grad()    loss = -dist_y.log_prob(dataset).mean()    loss.backward()    optimizer.step()        if step % 500 == 0:        print('step: {}, loss: {}'.format(step, loss.item()))
step: 0, loss: 8.446191787719727step: 500, loss: 2.0197808742523193step: 1000, loss: 1.794958472251892step: 1500, loss: 1.73616361618042step: 2000, loss: 1.7254879474639893step: 2500, loss: 1.691617488861084step: 3000, loss: 1.679549217224121step: 3500, loss: 1.6967085599899292step: 4000, loss: 1.6723777055740356step: 4500, loss: 1.6505967378616333step: 5000, loss: 1.8024061918258667
X_flow = dist_y.sample(torch.Size([1000,])).detach().numpy()plt.title(r'Joint Distribution')plt.xlabel(r'$x_1$')plt.ylabel(r'$x_2$')plt.scatter(X[:,0], X[:,1], label='data', alpha=0.5)plt.scatter(X_flow[:,0], X_flow[:,1], color='firebrick', label='flow', alpha=0.5)plt.legend()plt.show()

plt.subplot(1, 2, 1)sns.distplot(X[:,0], hist=False, kde=True,              bins=None,             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='data')sns.distplot(X_flow[:,0], hist=False, kde=True,              bins=None, color='firebrick',             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='flow')plt.title(r'$p(x_1)$')plt.subplot(1, 2, 2)sns.distplot(X[:,1], hist=False, kde=True,              bins=None,             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='data')sns.distplot(X_flow[:,1], hist=False, kde=True,              bins=None, color='firebrick',             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='flow')plt.title(r'$p(x_2)$')plt.show()

We see from the output that this normalizing flow has successfully learnt both the univariate marginals and the bivariate distribution.