Let's begin training our first Normalizing Flow with a simple example! The target distribution that we intend to learn is,

\begin{aligned} Y' &\sim \mathcal{N}\left(\mu=\begin{bmatrix} 5 \\ 5 \end{bmatrix}, \Sigma=\begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix} \right) \end{aligned},

that is, a linear transformation of an standard multivariate normal distribution. The base distribution is,

\begin{aligned} X &\sim \mathcal{N}\left(\mu=\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \Sigma=\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \right) \end{aligned},

that is, standard normal noise (which is typical for Normalizing Flows). The task is to learn some bijection $g_\theta$ so that

\begin{aligned} Y &\triangleq g_\theta(X) \\ &\sim Y' \end{aligned}

approximately holds. We will define our Normalizing Flow, $g_\theta$ by a single affine transformation,

\begin{aligned} g_\theta(\mathbf{x}) &\triangleq \begin{bmatrix} \mu_1 \\ \mu_2(x_1) \end{bmatrix} + \begin{bmatrix} \sigma_1 \\ \sigma_2(x_1) \end{bmatrix}\otimes\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}. \end{aligned}

In this notation, $\mathbf{x}=(x_1,x_2)^T$, $\otimes$ denotes element-wise multiplication, and the parameters are the scalars $\mu_1,\sigma_1$ and the parameters of the neural networks $\mu_2(\cdot)$ and $\sigma_2(\cdot)$. (Think of the NNs as very simple shallow feedforward nets in this example.) This is an example of Inverse Autoregressive Flow.

There are several metrics we could use to train $Y$ to be close in distribution to $Y'$. First, let us denote the target distribution of $Y'$ by $p(\cdot)$ and the learnable distribution of the normalizing flow, $Y$, as $q_\theta(\cdot)$ (in the following sections, we will explain how to calculate $q_\theta$ from $g_\theta$). Let's use the forward KL-divergence,

\begin{aligned} \text{KL}\{p\ ||\ q_\theta\} &\triangleq \mathbb{E}_{p(\cdot)}\left[\log\frac{p(Y')}{q_\theta(Y')}\right] \\ &= -\mathbb{E}_{p(\cdot)}\left[\log q_\theta(Y')\right] + C, \end{aligned}

where C is a constant that does not depend on $\theta$. In practice, we draw a finite sample, $\{y_1,\ldots,y_M\}$, from $p$ and optimize a Monte Carlo estimate of the KL-divergence with stochastic gradient descent so that the loss is,

\begin{aligned} \mathcal{L}(\theta) &= -\frac{1}{M}\sum^M_{m=1}\log(q_\theta(y_m)) \end{aligned}

So, to summarize, the task at hand is to learn how to transform standard bivariate normal noise into another bivariate normal distribution using an affine transformation, and we will do so by matching distributions with the KL-divergence metric.

## Implementation​

First, we import the relevant libraries:

import torchimport flowtorch.bijectors as bijimport flowtorch.distributions as dist

The base and target distributions are defined using standard PyTorch:

base_dist = torch.distributions.Independent(  torch.distributions.Normal(torch.zeros(2), torch.ones(2)),   1)target_dist = torch.distributions.Independent(  torch.distributions.Normal(torch.zeros(2)+5, torch.ones(2)*0.5),  1)

Note the use of torch.distributions.Independent so that our base and target distributions are vector valued.

We can visualize samples from the base and target: A Normalizing Flow is created in two steps. First, we create a "plan" for the flow as a flowtorch.bijectors.Bijector object,

# Lazily instantiated flowbijectors = bij.AffineAutoregressive()

This plan is then made concrete by combining it with the base distributions, which provides the input shape, and constructing a flowtorch.distributions.Flow object, and extension of torch.distributions.Distribution:

# Instantiate transformed distribution and parametersflow = dist.Flow(base_dist, bijectors)

At this point, we have an object, flow, for the distribution, $q_\theta(\cdot)$, that follows the standard PyTorch interface. Therefore, it can be trained with the following code, which will be familiar for readers who have used torch.distributions before:

# Training loopopt = torch.optim.Adam(flow.parameters(), lr=5e-3)for idx in range(3001):    opt.zero_grad()    # Minimize KL(p || q)    y = target_dist.sample((1000,))    loss = -flow.log_prob(y).mean()    if idx % 500 == 0:        print('epoch', idx, 'loss', loss)    loss.backward()    opt.step()

Note how we obtain the learnable parameters of the normalizing flow from the flow object, which is a torch.nn.Module. Visualizing samples after learning, we see that we have been successful in matching the target distribution: Congratulations on training your first flow!

## Discussion​

This simple example illustrates a few important points of FlowTorch's design:

Firstly, Bijector objects are agnostic to their shape. A Bijector object specifies how the shape is changed by the forward and inverse operations, and then calculates the exact shapes when it obtains knowledge of the base distribution, when flow = dist.Flow(base_dist, bijectors) is run. Any neural networks or other parametrized functions, which also require this shape information, are not instantiated until the same moment. In this sense, a Bijector can be thought of as a lazy plan for creating a normalizing flow. The advantage of doing things this way is that the shape information can be "type checked" and does not need to be specified in multiple locations (ensuring these quantities are consistent).

Secondly, all objects are designed to have sensible defaults. We do not need to define the conditioning network for bijectors.AffineAutoregressive, it will use a MADE network with sensible hyperparameters and defer initialization until it later receives shape information. Thirdly, there is compatibility, in as far as is possible, with standard PyTorch interfaces such as torch.distributions.