# Multivariate Bijections

## Background​

The fundamental idea of normalizing flows also applies to multivariate random variables, and this is where its value is clearly seen - representing complex high-dimensional distributions. In this case, a simple multivariate source of noise, for example a standard i.i.d. normal distribution, $X\sim\mathcal{N}(\mathbf{0},I_{D\times D})$, is passed through a vector-valued bijection, $g:\mathbb{R}^D\rightarrow\mathbb{R}^D$, to produce the more complex transformed variable $Y=g(X)$.

Sampling $Y$ is again trivial and involves evaluation of the forward pass of $g$. We can score $Y$ using the multivariate substitution rule of integral calculus,

\begin{aligned} \mathbb{E}_{p_X(\cdot)}\left[f(X)\right] &= \int_{\text{supp}(X)}f(\mathbf{x})p_X(\mathbf{x})d\mathbf{x}\\ &= \int_{\text{supp}(Y)}f(g^{-1}(\mathbf{y}))p_X(g^{-1}(\mathbf{y}))\det\left|\frac{d\mathbf{x}}{d\mathbf{y}}\right|d\mathbf{y}\\ &= \mathbb{E}_{p_Y(\cdot)}\left[f(g^{-1}(Y))\right], \end{aligned}

where $d\mathbf{x}/d\mathbf{y}$ denotes the Jacobian matrix of $g^{-1}(\mathbf{y})$. Equating the last two lines we get,

\begin{aligned} \log(p_Y(y)) &= \log(p_X(g^{-1}(y)))+\log\left(\det\left|\frac{d\mathbf{x}}{d\mathbf{y}}\right|\right)\\ &= \log(p_X(g^{-1}(y)))-\log\left(\det\left|\frac{d\mathbf{y}}{d\mathbf{x}}\right|\right). \end{aligned}

Inituitively, this equation says that the density of $Y$ is equal to the density at the corresponding point in $X$ plus a term that corrects for the warp in volume around an infinitesimally small volume around $Y$ caused by the transformation. For instance, in $2$-dimensions, the geometric interpretation of the absolute value of the determinant of a Jacobian is that it represents the area of a parallelogram with edges defined by the columns of the Jacobian. In $n$-dimensions, the geometric interpretation of the absolute value of the determinant Jacobian is that is represents the hyper-volume of a parallelepiped with $n$ edges defined by the columns of the Jacobian (see a calculus reference such as  for more details).

Similar to the univariate case, we can compose such bijective transformations to produce even more complex distributions. By an inductive argument, if we have $L$ transforms $g_{(0)}, g_{(1)},\ldots,g_{(L-1)}$, then the log-density of the transformed variable $Y=(g_{(0)}\circ g_{(1)}\circ\cdots\circ g_{(L-1)})(X)$ is

\begin{aligned} \log(p_Y(y)) &= \log\left(p_X\left(\left(g_{(L-1)}^{-1}\circ\cdots\circ g_{(0)}^{-1}\right)\left(y\right)\right)\right)+\sum^{L-1}_{l=0}\log\left(\left|\frac{dg^{-1}_{(l)}(y_{(l)})}{dy'}\right|\right), \end{aligned}

where we've defined $y_{(0)}=x$, $y_{(L-1)}=y$ for convenience of notation.

The main challenge is in designing parametrizable multivariate bijections that have closed form expressions for both $g$ and $g^{-1}$, a tractable Jacobian whose calculation scales with $O(D)$ or $O(1)$ rather than $O(D^3)$, and can express a flexible class of functions.

## Multivariate Bijectors​

In this section, we show how to use bij.SplineAutoregressive to learn the bivariate toy distribution from our running example. Making a simple change we can represent bivariate distributions of the form, $p(x_1,x_2)=p(x_1)p(x_2|x_1)$:

dist_x = torch.distributions.Independent(  torch.distributions.Normal(torch.zeros(2), torch.ones(2)),   1)bijector = bij.SplineAutoregressive()dist_y = dist.Flow(dist_x, bijector)

The bij.SplineAutoregressive bijector extends bij.Spline so that the spline parameters are the output of an autoregressive neural network. See [durkan2019neural] and [germain2015made] for more details.

Similarly to before, we train this distribution on the toy dataset and plot the results:

dataset = torch.tensor(X, dtype=torch.float)optimizer = torch.optim.Adam(dist_y.parameters(), lr=5e-3)for step in range(steps):    optimizer.zero_grad()    loss = -dist_y.log_prob(dataset).mean()    loss.backward()    optimizer.step()        if step % 500 == 0:        print('step: {}, loss: {}'.format(step, loss.item()))
step: 0, loss: 8.446191787719727step: 500, loss: 2.0197808742523193step: 1000, loss: 1.794958472251892step: 1500, loss: 1.73616361618042step: 2000, loss: 1.7254879474639893step: 2500, loss: 1.691617488861084step: 3000, loss: 1.679549217224121step: 3500, loss: 1.6967085599899292step: 4000, loss: 1.6723777055740356step: 4500, loss: 1.6505967378616333step: 5000, loss: 1.8024061918258667
X_flow = dist_y.sample(torch.Size([1000,])).detach().numpy()plt.title(r'Joint Distribution')plt.xlabel(r'$x_1$')plt.ylabel(r'$x_2$')plt.scatter(X[:,0], X[:,1], label='data', alpha=0.5)plt.scatter(X_flow[:,0], X_flow[:,1], color='firebrick', label='flow', alpha=0.5)plt.legend()plt.show() plt.subplot(1, 2, 1)sns.distplot(X[:,0], hist=False, kde=True,              bins=None,             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='data')sns.distplot(X_flow[:,0], hist=False, kde=True,              bins=None, color='firebrick',             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='flow')plt.title(r'$p(x_1)$')plt.subplot(1, 2, 2)sns.distplot(X[:,1], hist=False, kde=True,              bins=None,             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='data')sns.distplot(X_flow[:,1], hist=False, kde=True,              bins=None, color='firebrick',             hist_kws={'edgecolor':'black'},             kde_kws={'linewidth': 2},             label='flow')plt.title(r'$p(x_2)$')plt.show() We see from the output that this normalizing flow has successfully learnt both the univariate marginals and the bivariate distribution.