0% found this document useful (0 votes)
31 views16 pages

Idea 2024 (L) Neural Autoregressive Flows

The document presents Neural Autoregressive Flows (NAF), which unify and generalize existing normalizing flows and autoregressive models to improve density estimation and variational inference. NAF replaces conditional affine transformations with more complex invertible univariate transformations expressed as monotonic neural networks, enhancing expressiveness and performance in capturing multimodal distributions. Empirical results demonstrate that NAF outperforms state-of-the-art methods in both sample generation and density modeling tasks.

Uploaded by

ngkhoadeakin3216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views16 pages

Idea 2024 (L) Neural Autoregressive Flows

The document presents Neural Autoregressive Flows (NAF), which unify and generalize existing normalizing flows and autoregressive models to improve density estimation and variational inference. NAF replaces conditional affine transformations with more complex invertible univariate transformations expressed as monotonic neural networks, enhancing expressiveness and performance in capturing multimodal distributions. Empirical results demonstrate that NAF outperforms state-of-the-art methods in both sample generation and density modeling tasks.

Uploaded by

ngkhoadeakin3216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Neural Autoregressive Flows

Chin-Wei Huang 1 2 * David Krueger 1 2 * Alexandre Lacoste 2 Aaron Courville 1 3

Abstract implicit losses, in the case of adversarial models (Goodfel-


low et al., 2014). (2) In the context of variational inference
Normalizing flows and autoregressive models (Rezende & Mohamed, 2015), they can be used to improve
have been successfully combined to produce the variational approximation to the posterior by parameter-
arXiv:1804.00779v1 [[Link]] 3 Apr 2018

state-of-the-art results in density estimation, via izing more complex distributions. This is important since
Masked Autoregressive Flows (MAF) (Papa- a poor variational approximation to the posterior can fail
makarios et al., 2017), and to accelerate state- to reflect the right amount of uncertainty, and/or be biased
of-the-art WaveNet-based speech synthesis to (Turner & Sahani, 2011), resulting in inaccurate and un-
20x faster than real-time (Oord et al., 2017), reliable predictions. We are thus interested in improving
via Inverse Autoregressive Flows (IAF) (Kingma techniques for normalizing flows.
et al., 2016). We unify and generalize these ap-
proaches, replacing the (conditionally) affine uni- Recent work by Kingma et al. (2016) reinterprets autore-
variate transformations of MAF/IAF with a more gressive models as invertible transformations suitable for
general class of invertible univariate transforma- constructing normalizing flows. The inverse transformation
tions expressed as monotonic neural networks. process, unlike sampling from the autoregressive model, is
We demonstrate that the proposed neural autore- not sequential and thus can be accelerated via parallel com-
gressive flows (NAF) are universal approxima- putation. This allows multiple layers of transformations to
tors for continuous probability distributions, and be stacked, increasing expressiveness for better variational
their greater expressivity allows them to better inference (Kingma et al., 2016) or better density estimation
capture multimodal target distributions. Experi- for generative models (Papamakarios et al., 2017). Stacking
mentally, NAF yields state-of-the-art performance also makes it possible to improve on the sequential condi-
on a suite of density estimation tasks and outper- tional factorization assumed by autoregressive models such
forms IAF in variational autoencoders trained on as PixelRNN or PixelCNN (Oord et al., 2016), and thus
binarized MNIST. 1 define a more flexible joint probability.
We note that the normalizing flow introduced by Kingma
et al. (2016) only applies an affine transformation of each
1. Introduction scalar random variable. Although this transformation is
conditioned on preceding variables, the resulting flow can
Invertible transformations with a tractable Jacobian, also still be susceptible to bad local minima, and thus failure to
known as normalizing flows, are useful tools in many ma- capture the multimodal shape of a target density; see Figure
chine learning problems, for example: (1) In the context 1 and 2.
of deep generative models, training necessitates evaluat-
ing data samples under the model’s inverse transformation 1.1. Contributions of this work
(Dinh et al., 2016). Tractable density is an appealing prop-
erty for these models, since it allows the objective of interest We propose replacing the conditional affine transformation
to be directly optimized; whereas other mainstream methods of Kingma et al. (2016) with a more rich family of trans-
rely on alternative losses, in the case of intractable density formations, and note the requirements for doing so. We
models (Kingma & Welling, 2013; Rezende et al., 2014), or determine that very general transformations, for instance
parametrized by deep neural networks, are possible. We
*
Equal contribution 1 MILA, University of Montreal 2 Element then propose and evaluate several specific monotonic neural
AI 3 CIFAR fellow. Correspondence to: Chin-Wei Huang <chin-
[Link]@[Link]>.
network architectures which are more suited for learning
multimodal distributions. Concretely, our method amounts
Preliminary work. Under review by the International Conference to using an autoregressive model to output the weights of
on Machine Learning (ICML). Do not distribute. multiple independent transformer networks, each of which
1
Implementation can be found at [Link] operates on a single random variable, replacing the affine
Huang/NAF/
Neural Autoregressive Flows

the composition of invertible functions is itself invertible,


complex NFs are often formed via function composition (or
“stacking”) of simpler NFs.
Normalizing flows are most commonly trained to produce
an output distribution pY (y) which matches a target dis-
tribution (or, more generally, energy function) ptarget (y)
as measured by the KL-divergence KL(pY (y)||ptarget (y)).
When X or Y is distributed by some simple distribution,
Figure 1. Energy function fitting using IAF. such as uniform or standard normal, we call it an unstruc-
Left: true distribution. Center: IAF-affine. Right: IAF-DSF. tured noise; and we call it a structured noise when the distri-
bution is complex and correlated. Two common settings are
maximum likelihood and variational inference. Note that
these two settings are typically viewed as optimizing differ-
ent directions of the KL-divergence, whereas we provide
a unified view in terms of different input and target distri-
butions. A detailed derivation is presented in the appendix
(See Section A).
For maximum likelihood applications (Dinh et al., 2016;
Papamakarios et al., 2017), ptarget (y) is typically a simple
Figure 2. Density estimation using MAF. prior over latent variable y, and f attempts to disentangle
Left: true distribution. Center: MAF-affine. Right: MAF-DSF. the complex empirical distribution of the data, pX (x) into
a simple latent representation pY (y) matching the prior (
transformations of previous works. structured to unstructured) 3 .

Empirically, we show that our method works better than In a typical application of variational inference (Rezende
the state-of-the-art affine autoregressive flows of Kingma & Mohamed, 2015; Kingma et al., 2016), ptarget (y) is a
et al. (2016) and Papamakarios et al. (2017), both as a complex posterior over latent variables y, and f transforms
sample generator which captures multimodal target densities a simple input distribution (for instance a standard normal
with higher fidelity, and as a density model which more distribution) over x into a complex approximate posterior
accurately evaluates the likelihood of data samples drawn pY (y) (unstructured to structured). In either case, since pX
from an unknown distribution. does not depend on θ, the gradients of the KL-divergence
are typically estimated by Monte Carlo:
We also demonstrate that our method is a universal approxi-
mator on proper distributions in real space, which guaran- ∇θ DKL pY (y)||ptarget (y)
tees the expressiveness of the chosen parameterization and
Z
pY (y)
supports our empirical findings. = ∇θ pY (y) log dy
Y p target (y)
Z
pY (y)
2. Background = pX (x)∇θ log d x (2)
X p target (y)

A (finite) normalizing flow (NF), or flow, is an invertible Applying the change of variables formula from Equation 1
function fθ : X → Y used to express a transformation be- to the right hand side of Equation 2 yields:
tween random variables 2 . Since f is invertible, the change " #
of variables formula can be used to translate between densi- −1
∂fθ (x)
ties pY (y) and pX (x): Ex∼pX (x) ∇θ log pX (x) − ∇θ log ptarget (y)
y=fθ (x) ∂x
−1 (3)
∂f (x)
pY (y) = pX (x) (1)
∂x 3
It may also be possible to form a generative model from such
a flow, by passing samples from the prior ptarget (y) through f −1 ,
The determinant of f ’s Jacobian appears on the right hand
although the cost of doing so may vary. For example, RealNVP
side to account for the way in which f can (locally) ex- (Dinh et al., 2016) was devised as a generative model, and its in-
pand or contract regions of X, thereby lowering or raising verse computation is as cheap as its forward computation, whereas
the resulting density in those regions’ images in Y . Since MAF (Papamakarios et al., 2017) is designed for density estima-
tion and is much more expensive to sample from. For the NAF
2
We use x and y to denote inputs and outputs of a function, architectures we employ, we do not have an analytic expression
not the inputs and targets of a supervised learning problem. for f −1 , but it is possible to approximate it numerically.
Neural Autoregressive Flows

particularly successful pre-existing approach. Affine autore-


gressive flows yield a triangular Jacobian matrix, so that the
log-determinant can be computed in linear time, as the sum
of the diagonal entries on log scale. In AAFs, the compo-
nents of x and y are given an order (which may be chosen
arbitrarily), and yt is computed as a function of x1:t . Specifi-
cally, this function can be decomposed via an autoregressive
conditioner, c, and an invertible transformer, τ , as 6 :
.
yt = f (x1:t ) = τ (c(x1:t−1 ), xt ) (4)

It is possible to efficiently compute the output of c for all


t in a single forward pass using a model such as MADE
(Germain et al., 2015), as pointed out by Kingma et al.
(2016).
Figure 3. Difference between autoregressive and inverse autore-
gressive transformations (left), and between IAF and MAF (right). In previous work, τ is taken to be an affine transformation
Upper left: sample generation of an autoregressive model. Un- with parameters µ ∈ R, σ > 0 output from c. For instance
structured noise is transformed into structured noise. Lower left: Dinh et al. (2016) use:
inverse autoregressive transformation of structured data. Struc-
tured variables are transformed into unstructured variables. Upper τ (µ, σ, xt ) = µ + σxt (5)
right: IAF-style sampling. Lower right: MAF-style evaluation
of structured data. represents unstructured noise and s represents with σ produced by an exponential nonlinearity. Kingma
structured noise. et al. (2016) use:

Thus for efficient training, the following operations must be τ (µ, σ, xt ) = σxt + (1 − σ)µ (6)
tractable and cheap:
with σ produced by a sigmoid nonlinearity. Such transform-
ers are trivially invertible, but their relative simplicity also
1. Sampling x ∼ pX (x) means that the expressivity of f comes entirely from the
complexity of c and from stacking multiple AAFs (poten-
2. Computing y = f (x)
tially using different orderings of the variables) 7 . However,
3. Computing the gradient of the log-likelihood of y = the only requirements on τ are:
f (x); x ∼ pX (x) under both pY (y) and ptarget (y)
1. The transformer τ must be invertible as a function of
4. Computing the gradient of the log-determinant of the xt .
Jacobian of f
dyt
2. dxt must be cheap to compute.
Research on constructing NFs, such as our work, focuses
on finding ways to parametrize flows which meet the above This raises the possibility of using a more powerful trans-
requirements while being maximally flexible in terms of the former in order to increase the expressivity of the flow.
transformations which they can represent. Note that some
of the terms of of Equation 3 may be constant with respect
3. Neural Autoregressive Flows
to θ 4 and thus trivial to differentiate, such as pX (x) in the
maximum likelihood setting. We propose replacing the affine transformer used in previous
5 works with a neural network, yielding a more rich family of
Affine autoregressive flows (AAFs) , such as inverse au-
distributions with only a minor increase in computation and
toregressive flows (IAF) (Kingma et al., 2016), are one
memory requirements. Specifically,
4
There might be some other parameters other than θ that are
learnable, such as parameters of pX and ptarget in the variational τ (c(x1:t−1 ), xt ) = DNN(xt ; φ = c(x1:t−1 )) (7)
inference and maximum likelihood settings, respectively.
5
Our terminology differs from previous works, and hence holds
6
Dinh et al. (2014) use m and g −1 to denote c and τ , and refer
the potential for confusion, but we believe it is apt. Under our to them as the “coupling function” and “coupling law”, respec-
unifying perspective, NAF, IAF, AF, and MAF all make use of the tively.
7
same principle, which is an invertible transformer conditioned on Permuting the order of variables is itself a normalizing flow
the outputs of an autoregressive (and emphatically not an inverse that does not expand or contract space and can be inverted by
autoregressive) conditioner. another permutation.
Neural Autoregressive Flows
0.45
5
0.40 0.5
4
0.35
3 0.4
0.30

y= c(x)
2
0.25

p(x)

p(y)
0.3
0.20 1

0.15 0 0.2
0.10 1
0.1
0.05 2
0.00 3 0.0
2 0 2 4 4 2 0 2 4 1 0 1 2 3 4
x x y
0.45
0.40 6
0.35 4 0.8

0.30 2

y= c(x)
0.6
0.25 0

p(x)

p(y)
0.20 2
0.4
0.15 4
0.10 6 0.2
0.05 8
0.00 0.0
2 0 2 4 4 2 0 2 4 6 4 2 0 2 4
x x y

(a) Neural autoregressive flows (NAF) Figure 5. Illustration of the effects of traditional IAF (top), and our
proposed NAF (bottom). Areas where the slope of the transformer
τc is greater/less than 1, are compressed/expanded (respectively)
in the output distribution. Inflection points in τc (xt ) (middle) can
transform a unimodal p(xt ) (left) into a multimodal p(yt ) (right);
NAF allows for such inflection points, whereas IAF does not.

Whereas affine transformers require information about mul-


timodality in yt to flow through x1:t−1 , our neural autore-
gressive flows (NAFs) are able to induce multimodality
(b) DSF (c) DDSF more naturally, via inflection points in τc , as shown in Fig-
ure 5. Intuitively, τc can be viewed as analogous to a cu-
Figure 4. Top: In neural autoregressive flows, the transforma- mulative distribution function (CDF), so that its derivative
tion of the current input variable is performed by an MLP whose corresponds to a PDF, where its inflection points yield local
parameters are output from an autoregressive conditioner model, maxima or minima.
.
ct = c(x1:t−1 ), which incorporates information from previous
input variables. Bottom: The architectures we use in this
work: deep sigmoidal flows (DSF) and deep dense sigmoidal
3.1. Transformer Architectures
flows (DDSF). See section 3.1 for details. In this work, we use two specific architectures for τc , which
we refer to as deep sigmoidal flows (DSF) and deep dense
sigmoidal flows (DDSF) (see Figure 4(b), 4(c) for an illus-
is a deep neural network which takes the scalar xt as in- tration). We find that small neural network transformers of
put and produces yt as output, and its weights and biases 1 or 2 hidden layers with 8 or 16 sigmoid units perform well
are given by the outputs of c(x1:t−1 )8 (see Figure 4(a)). across our experiments, although there are other possibil-
We refer to these values φ as pseudo-parameters, in order ities worth exploring (see Section 3.3). Sigmoids contain
to distinguish them from the statistical parameters of the inflection points, and so can easily induce inflection points
model. in τc , and thus multimodality in p(yt ). We begin by describ-
ing the DSF transformation, which is already sufficiently
We now state the condition for NAF to be strictly monotonic,
expressive to form a universal approximator for probability
and thus invertible (as per requirement 1):
distributions, as we prove in section 4.
Proposition 1. Using strictly positive weights and strictly
The DSF transformation resembles an MLP with a single
monotonic activation functions for τc is sufficient for the
hidden layer of sigmoid units. Naive use of sigmoid ac-
entire network to be strictly monotonic.
tivation functions would restrict the range of τc , however,
dyt
and result in a model that assigns 0 density to sufficiently
Meanwhile, dx t
and gradients wrt the pseudo-parameters 9 large or small yt , which is problematic when yt can take
can all be computed efficiently via backpropagation (as per on arbitrary real values. We address this issue by applying
requirement 2). the inverse sigmoid (or “logit”) function at the output layer.
8
We’ll sometimes write τc for τ (c(x1:t−1 ), ·). To ensure that the output’s preactivation is in the domain
9
Gradients for pseudo-parameters are backpropagated through of the logit (that is, (0, 1)), we combine the output of the
the conditioner, c, in order to train its parameters. sigmoid units via an attention-like (Bahdanau et al., 2014)
Neural Autoregressive Flows

softmax-weighted sums: matrix-matrix products for the forward and backwards prop-
agation through the graph of τc . In particular, we use a tech-
yt = σ −1 (w T
| {z }
·σ( |{z}
a · xt + |{z}
b ))) (8) nique similar to conditional weight normalization (CWN)
|{z}
1×d d×1 1×1 d×1 (Krueger et al., 2017) in our experiments with DDSF; see
P appendix for details.
where 0 < wi,j < 1, i wi,j = 1, as,t > 0, and d denotes
the number of hidden units 10 . 3.3. Possibilities for Alternative Architectures
Since all of the sigmoid activations are bounded between While the DSF and DDSF architectures performed well in
0 and 1, the final preactivation (which is their convex com- our experiments, there are many alternatives to be explored.
bination) is as well. The complete DSF transformation can One possibility is using other (strictly) monotonic activation
be seen as mapping the original random variable to a dif- functions in τc , such as leaky ReLUs (LReLU) (Xu et al.,
ferent space through an activation function, where doing 2015) or ELUs (Clevert et al., 2015). Leaky ReLUs in
affine/linear operations is non-linear with respect to the vari- particular are bijections on R and so would not require
able in the original space, and then mapping it back to the the softmax-weighted summation and activation function
original space through the inverse activation. inversion tricks discussed in the previous section.
When stacking multiple sigmoidal transformation, we real- Finally, we emphasize that in general, τ need not be ex-
ize it resembles an MLP with bottleneck as shown by the pressed as a neural architecture; it only needs to satisfy the
bottom left of Figure 4. A more general alternative is the requirements of invertibility and differentiability given at
deep dense sigmoidal flow (DDSF), which takes the form the end of section 2.
of a fully connected MLP:

h(l+1) = 4. NAFs are Universal Density Approximators


−1 (l+1) (l+1) (l+1) (l) (1+1)
σ ( |w {z }
· σ(a
| {z }
u
| {z }
·h
| {z }
+b| {z
))
}
(9) In this section, we prove that NAFs (specifically DSF) can
dl+1 ×dl+1 dl+1 dl+1 ×dl dl dl+1 be used to approximate any probability distribution over
real vectors arbitrarily well, given that τc has enough hidden
for 1 ≤ l ≤ L where
P h0 = x P and y = hL ; d0 = dL = 1.
units output by generic neural networks with autoregressive
We also require j wij = 1, j ukj = 1 for all i, k, and
conditioning. Ours is the first such result we are aware of
all parameters except b to be positive.
for finite normalizing flows.
We use either DSF (Equation 8) or DDSF (Equation 9) to
Our result builds on the work of Huang et al. (2017), who
define the transformer function τ in Equation 4. To compute
demonstrate the general universal representational capabil-
the log-determinant of Jacobian in a numerically stable way,
ity of inverse autoregressive transformations parameterized
we need to apply log-sum-exp to the chain rule
by an autoregressive neural network (that transform uni-
h ih i h i form random variables into any random variables in reals).
∇x y = ∇h(L−1) h(L) ∇h(L−2) h(L−1) , · · · , ∇h(0) h(1) However, we note that their proposition is weaker than we
(10) require, as there are no constraints on the parameterization
of the transformer τ , whereas we’ve constrained τ to have
We elaborate more on the numerical stability in parameter- strictly positive weights and monotonic activation functions,
ization and computation of logarithmic operations in the to ensure it is invertible throughout training.
supplementary materials.
The idea of proving the universal approximation theorem
3.2. Efficient Parametrization of Larger Transformers for DSF (1) in the IAF direction (which transforms unstruc-
tured random variables into structured random variables)
Multi-layer NAFs, such as DDSF, require c to output O(d2 ) resembles the concept of the inverse transform sampling:
pseudo-parameters, where d is the number of hidden units in we first draw a sample from a simple distribution, such as
each layer of τ . As this is impractical for large d, we propose uniform distribution, and then pass the sample though DSF.
parametrizing τ with O(d2 ) statistical parameters, but only If DSF converges to any inverse conditional CDF, the re-
O(d) pseudo-parameters which modulate the computation sulting random variable then converges in distribution to
on a per-unit basis, using a technique such as conditional any target random variable as long as the latter has positive
batch-normalization (CBN) (Dumoulin et al., 2016). Such continuous probability density everywhere in the reals. (2)
an approach also makes it possible to use minibatch-style For the MAF direction, DSF serves as a solution to the non-
10 linear independent component analysis problem (Hyvärinen
Constraints on the variables are enforced via activation func-
tions; w and a are outputs of a softmax, and softplus or exp, & Pajunen, 1999), which disentangles structured random
respectively. variables into uniformly and independently distributed ran-
Neural Autoregressive Flows

dom variables. (3) Combining the two, we further show as inverse autoregressive flows (IAF) and further devel-
that DSF can transform any structured noise variable into a oped by Chen et al. (2016) and Papamakarios et al. (2017)
random variable with any desired distribution. as autoregressive flows (AF) and masked autoregressive
flows (MAF), respectively; for details on their relation-
We define the following notation for the pre-logit of the DSF
ship to our work see Sections 2 and 3. While Dinh et al.
transformation (compare equation 8):
(2014) draw a particular connection between their NICE
n model and the Neural Autoregressive Density Estimator
X xt − bj (x1:t−1 )
S(xt , C(x1:t−1 )) = wj (x1:t−1 )·σ (NADE) (Larochelle & Murray, 2011), (Kingma et al.,
τj (x1:t−1 )
j=1 2016) were the first to highlight the general approach of
(11) using autoregressive models to construct normalizing flows.
where C = (wj , bj , τj )nj=1 are functions of x1:1−t param- Chen et al. (2016) and then Papamakarios et al. (2017) sub-
Pn Let bj be in (r0 , r1 ); τj be
eterized by neural networks. sequently noticed that this same approach could be used
bounded and positive; j=1 wj = 1 and wj > 0. See efficiently in reverse when the key operation is evaluating,
Appendix F and G for the proof. as opposed to sampling from, the flow’s learned output
Proposition 2. (DSF universally transforms uniform ran- density. Our method increases the expressivity of these
dom variables into any desired random variables) Let Y be previous approaches by using a neural net to output pseudo-
a random vector in Rm and assume Y has a strictly pos- parameters of another network, thus falling into the hyper-
itive and continuous probability density distribution. Let network framework (Ha et al., 2016; Bertinetto et al., 2016;
X ∼ Unif((0, 1)m ). Then there exists a sequence of func- Brabandere et al., 2016).
tions (Gn )n≥1 parameterized by autoregressive neural net- There has been a growing interest in normalizing flows
works in the following form (NFs) in the deep learning community, driven by success-
ful applications and structural advantages they have over
G(x)t = σ −1 (S (xt ; C t (x1:t−1 ))) (12) alternatives. Rippel & Adams (2013), Rezende & Mohamed
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such (2015) and Dinh et al. (2014) first introduced normalizing
. flows to the deep learning community as density models,
that Yn = Gn (X) converges in distribution to Y .
variational posteriors and generative models, respectively.
Proposition 3. (DSF universally transforms any random In contrast to traditional variational posteriors, NFs can
variables into uniformly distributed random variables) Let X represent a richer family of distributions without requir-
be a random vector in an open set U ⊂ Rm . Assume X has ing approximations (beyond Monte Carlo estimation of the
a positive and continuous probability density distribution. KL-divergence). The NF-based RealNVP-style generative
Let Y ∼ Unif((0, 1)m ). Then there exists a sequence of models (Dinh et al., 2016; 2014) also have qualitative ad-
functions (Hn )n≥1 parameterized by autoregressive neural vantages over alternative approaches. Unlike generative
networks in the following form adversarial networks (GANs) (Goodfellow et al., 2014)
H(x)t = S (xt ; C t (x1:t−1 )) (13) and varational autoencoders (VAEs) (Kingma & Welling,
2013; Rezende et al., 2014), computing likelihood is cheap.
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such Unlike autoregressive generative models, such as pixelC-
. NNs (Oord et al., 2016), sampling is also cheap. Unfortu-
that Yn = Hn (X) converges in distribution to Y .
Theorem 1. (DSF universally transforms any random vari- nately, in practice RealNVP-style models are not currently
ables into any desired random variables) Let X be a random competitive with autoregressive models in terms of like-
vector in an open set U ⊂ Rm . Let Y be a random vector lihood, perhaps due to the more restricted nature of the
in Rm . Assume both X and Y have a positive and contin- transformations they employ.
uous probability density distribution. Then there exists a Several promising recent works expand the capabilities of
sequence of functions (Kn )n≥1 parameterized by autore- NFs for generative modeling and density estimation, how-
gressive neural networks in the following form ever. Perhaps the most exciting example is Oord et al.
(2017), who propose the probability density distillation
K(x)t = σ −1 (S (xt ; C t (x1:t−1 ))) (14) technique to train an IAF (Kingma et al., 2016) based on
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such the autoregressive WaveNet (van den Oord et al., 2016)
. as a generative model using another pretrained WaveNet
that Yn = Kn (X) converges in distribution to Y .
model to express the target density, thus overcoming the
slow sequential sampling procedure required by the original
5. Related work WaveNet (and characteristic of autoregressive models in
Neural autoregressive flows are a generalization of the affine general), and reaching super-real-time speeds suitable for
autoregressive flows introduced by Kingma et al. (2016) production. The previously mentioned MAF technique (Pa-
Neural Autoregressive Flows

Density estimation with MAF Fitting energy with IAF


1.2
Table 1. Using DSF to improve variational inference. We report IAF-affine
1.0
40 IAF-DSF
the number of affine IAF with our implementation. We note that IAF-DDSF
0.8
30
the log likelihood reported by Kingma et al. (2016) is 78.88. The

KL(p||q)

KL(q||p)
0.6
average and standard deviation are carried out with 5 trials of 0.4
20

experiments with different random seeds. 0.2 10


MAF-affine
MAF-DSF
Model ELBO log p(x) 0.0
MAF-DDSF 0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
VAE 85.00 ± 0.03 81.66 ± 0.05 log #iterations log #iterations

IAF-affine 82.25 ± 0.05 80.05 ± 0.04


IAF-DSF 81.92 ± 0.04 79.86 ± 0.01 Figure 7. Learning curve of MAF-style and IAF-style training. q
denotes our trained model, and p denotes the target.

y(t) = sin(2 ft) f q(f)


1.00
0.75
kde
0.50
counts
0.25

y(t)
0.00
0.25
0.50
f=0.6
0.75
f=1.2
1.00
f=1.8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
t f

Figure 8. The DSF model effectively captures the true posterior


distribution over the frequency of a sine wave. Left: The three
observations (marked with red x’s) are compatible with sine waves
of frequency f ∈ 0.0, 0.6, 1.2, 1.8. Right: a histogram of sam-
ples from the DSF approximate posterior (“counts”) and a Kernel
Density Estimate of the distribution it represents (KDE).

6. Experiments
Our experiments evaluate NAFs on the classic applications
of variational inference and density estimation, where we
Figure 6. Fitting grid of Gaussian distributions using maximum outperform IAF and MAF baselines. We first demonstrate
likelihood. Left: true distribution. Center: affine autoregressive the qualitative advantage NAFs have over AAFs in energy
flow (AAF). Right: neural autoregressive flow (NAF) function fitting and density estimation (Section 6.1). We
then demonstrate the capability of NAFs to capture a mul-
timodal Bayesian posterior in a limited data setting (Sec-
pamakarios et al., 2017) further demonstrates the potential tion 6.2). For larger-scale experiments, we show that using
of NFs to improve on state-of-the-art autoregressive density NAF instead of IAF to approximate the posterior distribution
estimation models; such highly performant MAF models of latent variables in a variational autoencoder (Kingma &
could also be “distilled” for rapid sampling using the same Welling, 2013; Rezende et al., 2014) yields better likelihood
procedure as in Oord et al. (2017). results on binarized MNIST (Larochelle & Murray, 2011)
(Section 6.3). Finally, we report our experimental results on
Other recent works also find novel applications of NFs, density estimation of a suite of UCI datasets (Section 6.4).
demonstrating their broad utility. Loaiza-Ganem et al.
(2017) use NFs to solve maximum entropy problems, rather 6.1. Toy energy fitting and density estimation
than match a target distribution. Louizos & Welling (2017)
and Krueger et al. (2017) apply NFs to express approxi- 6.1.1. E XPRESSIVENESS
mate posteriors over parameters of neural networks. Song
First, we demonstrate that, in the case of marginally inde-
et al. (2017) use NFs as a proposal distribution in a novel
pendent distributions, affine transformation can fail to fit the
Metropolis-Hastings MCMC algorithm.
true distribution. We consider a mixture of Gaussian density
Finally, there are also several works which develop new estimation task. We define the modes of the Gaussians to
techniques for constructing NFs that are orthogonal to ours be laid out on a 2D meshgrid within the range [-5,5], and
(Tomczak & Welling, 2017; 2016; Gemici et al., 2016; Du- consider 2, 5 and 10 modes on each dimension. While the
venaud et al., 2016; Berg et al., 2018). affine flow only produces a single mode, the neural flow
Neural Autoregressive Flows

Table 2. Test log-likelihood and error bars of 2 standard deviations on the 5 datasets (5 trials of experiments). Neural autoregressive flows
(NAFs) produce state-of-the-art density estimation results on all 5 datasets. The numbers (5 or 10) in parantheses indicate the number of
transformations which were stacked; for TAN (Oliva et al., 2018), we include their best results, achieved using different architectures on
different datasets. We also include validation results to give future researchers a fair way of comparing their methods with ours during
development.
Model POWER GAS HEPMASS MINIBOONE BSDS300
MADE MoG 0.40 ± 0.01 8.47 ± 0.02 −15.15 ± 0.02 −12.27 ± 0.47 153.71 ± 0.28
MAF-affine (5) 0.14 ± 0.01 9.07 ± 0.02 −17.70 ± 0.02 −11.75 ± 0.44 155.69 ± 0.28
MAF-affine (10) 0.24 ± 0.01 10.08 ± 0.02 −17.73 ± 0.02 −12.24 ± 0.45 154.93 ± 0.28
MAF-affine MoG (5) 0.30 ± 0.01 9.59 ± 0.02 −17.39 ± 0.02 −11.68 ± 0.44 156.36 ± 0.28
TAN (various architectures) 0.48 ± 0.01 11.19 ± 0.02 −15.12 ± 0.02 −11.01 ± 0.48 157.03 ± 0.07
MAF-DDSF (5) 0.62 ± 0.01 11.91 ± 0.13 −15.09 ± 0.40 −8.86 ± 0.15 157.73 ± 0.04
MAF-DDSF (10) 0.60 ± 0.02 11.96 ± 0.33 −15.32 ± 0.23 −9.01 ± 0.01 157.43 ± 0.30
MAF-DDSF (5) valid 0.63 ± 0.01 11.91 ± 0.13 15.10 ± 0.42 −8.38 ± 0.13 172.89 ± 0.04
MAF-DDSF (10) valid 0.60 ± 0.02 11.95 ± 0.33 15.34 ± 0.24 −8.50 ± 0.03 172.58 ± 0.32

matches the target distribution quite well even up to a 10x10 (Table 1). Here again the DSF architecture outperforms
grid with 100 modes (see Figure 5). both standard IAF and the traditional independent Gaussian
posterior by a statistically significant margin.
6.1.2. C ONVERGENCE
6.4. Density Estimation with Masked Autoregressive
We then repeat the experiment that produces Figure 1 and 2
Flows
16 times, smooth out the learning curve and present average
convergence result of each model with its corresponding We replicate the density estimation experiments of Papa-
standard deviation. For affine flow, we stack 6 layers of makarios et al. (2017), which compare MADE (Germain
transformation with reversed ordering. For DSF and DDSF et al., 2015) and RealNVP (Dinh et al., 2016) to their pro-
we used one transformation. We set d = 16 for both, L = 2 posed MAF model (using either 5 or 10 layers of MAF) on
for DDSF. BSDS300 (Martin et al., 2001) as well as 4 UCI datasets
(Lichman, 2013) processed as in Uria et al. (2013). Simply
6.2. Sine Wave experiment replacing the affine transformer with our DDSF architecture
in their best performing architecture for each task (keeping
Here we demonstrate the ability of DSF to capture multi-
all other settings fixed) results in substantial performance
modal posterior distributions. To do so, we create a toy
gains, and also outperforms the more recent Transformation
experiment where the goal is to infer the posterior over the
Autoregressive Networks (TAN) Oliva et al. (2018), setting
frequency of a sine wave, given only 3 datapoints. We fix
a new state-of-the-art for these tasks. Results are presented
the form of the function as y(t) = sin(2πf · t) and spec-
in Table 2.
ify a Uniform prior over the frequency: p(f ) = U ([0, 2]).
The task is to infer the posterior distribution p(f |T, Y )
given the dataset (T, Y ) = ((0, 5/6, 10/6), (0, 0, 0)), as 7. Conclusion
represented by the red crosses of Figure 8 (left). We as-
In this work we introduce the neural autoregressive flow
sume the data likelihood given the frequency parameter to
(NAF), a flexible method of tractably approximating rich
be p(yi |ti , f ) = N (yi ; yf (ti ), 0.125), where the variance
families of distributions. In particular, our experiments show
σ 2 = 0.125 represents the inherent uncertainty of the data.
that NAF is able to model multimodal distributions and
Figure 8 (right) shows that DSF learns a good posterior in
outperform related methods such as inverse autoregressive
this task.
flow in density estimation and variational inference. Our
work emphasizes the difficulty and importance of capturing
6.3. Amortized Approximate Posterior multimodality, as previous methods fail even on simple toy
We evaluate NAF’s ability to improve variational inference, tasks, whereas our method yields significant improvements
in the context of the binarized MNIST (Larochelle & Mur- in performance.
ray, 2011) benchmark using the well-known variational au-
toencoder (Kingma & Welling, 2013; Rezende et al., 2014)
Neural Autoregressive Flows

Acknowledgements Germain, Mathieu, Gregor, Karol, Murray, Iain, and


Larochelle, Hugo. MADE: masked autoencoder for dis-
We would like to thank Tegan Maharaj, Ahmed Touati, tribution estimation. In Proceedings of the 32nd Inter-
Shawn Tan and Giancarlo Kerg for helpful comments and national Conference on Machine Learning (ICML-15),
advice. We also thank George Papamakarios for providing 2015.
details on density estimation task’s setup.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,
References Bing, Warde-Farley, David, Ozair, Sherjil, Courville,
Aaron, and Bengio, Yoshua. Generative adversarial nets.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. In Advances in neural information processing systems,
Neural machine translation by jointly learning to align 2014.
and translate. arXiv preprint arXiv:1409.0473, 2014.
Ha, David, Dai, Andrew, and Le, Quoc V. Hypernetworks.
Berg, Rianne van den, Hasenclever, Leonard, Tomczak, arXiv preprint arXiv:1609.09106, 2016.
Jakub M, and Welling, Max. Sylvester normaliz-
ing flows for variational inference. arXiv preprint He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
arXiv:1803.05649, 2018. Jian. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision
Bertinetto, Luca, Henriques, João F., Valmadre, Jack, Torr, and pattern recognition, 2016.
Philip H. S., and Vedaldi, Andrea. Learning feed-forward
one-shot learners. CoRR, 2016. Huang, Chin-Wei, Touati, Ahmed, Dinh, Laurent, Drozdzal,
Michal, Havaei, Mohammad, Charlin, Laurent, and
Brabandere, Bert De, Jia, Xu, Tuytelaars, Tinne, and Gool, Courville, Aaron. Learnable explicit density for continu-
Luc Van. Dynamic filter networks. CoRR, 2016. ous latent space and variational inference. arXiv preprint
arXiv:1710.02248, 2017.
Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan,
Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and Hyvärinen, Aapo and Pajunen, Petteri. Nonlinear inde-
Abbeel, Pieter. Variational lossy autoencoder. arXiv pendent component analysis: Existence and uniqueness
preprint arXiv:1611.02731, 2016. results. Neural Networks, 1999.
Clevert, Djork-Arné, Unterthiner, Thomas, and Hochre- Kingma, Diederik P and Welling, Max. Auto-encoding
iter, Sepp. Fast and accurate deep network learn- variational bayes. arXiv preprint arXiv:1312.6114, 2013.
ing by exponential linear units (elus). arXiv preprint
arXiv:1511.07289, 2015. Kingma, Diederik. P., Salimans, T., and Welling, M. Varia-
tional Dropout and the Local Reparameterization Trick.
Cybenko, George. Approximation by superpositions of a arXiv e-prints, June 2015.
sigmoidal function. Mathematics of control, signals and
systems, 1989. Kingma, Diederik P, Salimans, Tim, Jozefowicz, Rafal,
Chen, Xi, Sutskever, Ilya, and Welling, Max. Improved
Dinh, Laurent, Krueger, David, and Bengio, Yoshua. NICE: variational inference with inverse autoregressive flow.
Non-linear independent components estimation. arXiv In Advances in Neural Information Processing Systems,
preprint arXiv:1410.8516, 2014. 2016.
Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Krueger, David, Huang, Chin-Wei, Islam, Riashat,
Density estimation using real nvp. arXiv preprint Turner, Ryan, Lacoste, Alexandre, and Courville,
arXiv:1605.08803, 2016. Aaron. Bayesian hypernetworks. arXiv preprint
arXiv:1710.04759, 2017.
Dumoulin, Vincent, Shlens, Jonathon, and Kudlur, Manju-
nath. A learned representation for artistic style. CoRR, Larochelle, Hugo and Murray, Iain. The neural autoregres-
2016. sive distribution estimator. In The Proceedings of the 14th
International Conference on Artificial Intelligence and
Duvenaud, David, Maclaurin, Dougal, and Adams, Ryan. Statistics, volume 15 of JMLR: W&CP, 2011.
Early stopping as nonparametric variational inference. In
Artificial Intelligence and Statistics, 2016. Lichman, M. UCI machine learning repository, 2013.

Gemici, Mevlana C, Rezende, Danilo, and Mohamed, Loaiza-Ganem, Gabriel, Gao, Yuanjun, and Cunningham,
Shakir. Normalizing flows on riemannian manifolds. John P. Maximum entropy flow networks. arXiv preprint
arXiv preprint arXiv:1611.02304, 2016. arXiv:1701.03504, 2017.
Neural Autoregressive Flows

Louizos, Christos and Welling, Max. Multiplicative nor- Tomczak, Jakub M and Welling, Max. Improving variational
malizing flows for variational bayesian neural networks. auto-encoders using householder flow. arXiv preprint
arXiv preprint arXiv:1703.01961, 2017. arXiv:1611.09630, 2016.
Martin, David, Fowlkes, Charless, Tal, Doron, and Malik, Tomczak, Jakub M and Welling, Max. Improving variational
Jitendra. A database of human segmented natural images auto-encoders using convex combination linear inverse
and its application to evaluating segmentation algorithms autoregressive flow. arXiv preprint arXiv:1706.02326,
and measuring ecological statistics. In Computer Vision, 2017.
2001. ICCV 2001. Proceedings. Eighth IEEE Interna-
tional Conference on, volume 2. IEEE, 2001. Turner, Richard E and Sahani, Maneesh. Two problems
with variational expectation maximisation for time-series
Oliva, J. B., Dubey, A., Póczos, B., Schneider, J., and Xing, models. Bayesian Time series models, 2011.
E. P. Transformation Autoregressive Networks. ArXiv
e-prints, January 2018. Uria, Benigno, Murray, Iain, and Larochelle, Hugo. Rnade:
The real-valued neural autoregressive density-estimator.
Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu, In Advances in Neural Information Processing Systems,
Koray. Pixel recurrent neural networks. arXiv preprint 2013.
arXiv:1601.06759, 2016.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Oord, Aaron van den, Li, Yazhe, Babuschkin, Igor, Si- Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
monyan, Karen, Vinyals, Oriol, Kavukcuoglu, Koray, and Kavukcuoglu, K. WaveNet: A Generative Model for
Driessche, George van den, Lockhart, Edward, Cobo, Raw Audio. ArXiv e-prints, September 2016.
Luis C, Stimberg, Florian, et al. Parallel wavenet:
Fast high-fidelity speech synthesis. arXiv preprint Xu, Bing, Wang, Naiyan, Chen, Tianqi, and Li, Mu. Em-
arXiv:1711.10433, 2017. pirical evaluation of rectified activations in convolutional
network. arXiv preprint arXiv:1505.00853, 2015.
Papamakarios, George, Murray, Iain, and Pavlakou, Theo.
Masked autoregressive flow for density estimation. In Ad-
vances in Neural Information Processing Systems, 2017.
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of
stochastic approximation by averaging. SIAM Journal on
Control and Optimization, 1992.
Reddi, Sashank J, Kale, Satyen, and Kumar, Sanjiv. On
the convergence of adam and beyond. In International
Conference on Learning Representations, 2018.
Rezende, Danilo Jimenez and Mohamed, Shakir. Varia-
tional inference with normalizing flows. arXiv preprint
arXiv:1505.05770, 2015.
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wier-
stra, Daan. Stochastic backpropagation and approximate
inference in deep generative models. arXiv preprint
arXiv:1401.4082, 2014.
Rippel, Oren and Adams, Ryan Prescott. High-dimensional
probability estimation with deep density models. arXiv
preprint arXiv:1302.5125, 2013.
Salimans, Tim and Kingma, Diederik P. Weight normaliza-
tion: A simple reparameterization to accelerate training of
deep neural networks. In Advances in Neural Information
Processing Systems, 2016.
Song, Jiaming, Zhao, Shengjia, and Ermon, Stefano. A-
nice-mc: Adversarial training for mcmc. In Advances in
Neural Information Processing Systems, 2017.
Neural Autoregressive Flows

A. Exclusive KL View of the MLE for some monotonic activation function Al , positive weight
vector wl,j , and bias bl,j . Differentiating this yields
Lets assume a change-of-variable model pZ (z) on the ran-
dom variable Z ∈ Rm , such as the one used in Dinh et al.
dhl,j dAl (pl,j )
(2016): z0 ∼ p0 (z0 ) and z = ψ(z0 ), where ψ is an invert- = · wl,j,k , (17)
ible function and density evaluation of z0 ∈ Rm is tractable dhl−1,k dpl,j
under p0 . The resulting density function can be written
which is greater than 0 since A is monotonic (and hence has
−1 positive derivative), and wl,j,k is positive by construction.
∂ψ(z0 )
pZ (z) = p0 (z0 ) Thus any unit of layer l is monotonic with respect to any
∂z0
unit of layer l − 1, for 1 ≤ l ≤ L.
The maximum likelihood principle requires us to minimize Now, suppose we have J monotonic functions fj of the in-
the following metric: PJ
put x. Then the weighted sum of these functions j=1 uj fj
DKL pdata (z)||pZ (z)
with uj > 0 is also monotonic with respect to x, since
= Epdata [log pdata (z) − log pZ (z)] J J
d X X dfj
∂ψ −1 (z)

uj fj = uj >0 (18)
= Epdata log pdata (z) − log p0 (z0 ) dx j=1 dx
∂z j=1
−1
" #
−1
∂ψ (z)
= Epdata log pdata (z) − log p0 (z0 )
∂z Finally, we use induction to show that all hl,j (including y)
are monotonic with respect to x.
which coincides with exclusive KL divergence; to see this,
we take X = Z, Y = Z0 , f = ψ −1 , pX = pdata , and 1. The base case (l = 1) is given by Equation 17.
ptarget = p0
"
−1
# 2. Suppose the inductive hypothesis holds, which means
∂f (x) hl,j is monotonic with respect to x for all j of layer l.
= EpX log pX (x) − log ptarget (y)
∂x Then by Equation 18, hl+1,k is also monotonic with
= DKL pY (y)||ptarget (y)
respect to x for all k.

This means we want to transform the empirical distribution Thus by mathematical induction, monotonicity of hl,j holds
pdata , or pX , to fit the target density (the usually unstruc- for all l and j.
tured, base distribution p0 ), as explained in Section 2.

C. Log Determinant of Jacobian


B. Monotonicity of NAF
As we mention at the end of Section 3.1, to compute the
Here we show that using strictly positive weights and strictly log-determinant of the Jacobian as part of the objective
monotonically increasing activation functions is sufficient function, we need to handle the numerical stability. We first
to ensure strict monotonicity of NAF. For brevity, we write derive the Jacobian of the DDSF (note that DSF is a special
monotonic, or monotonicity to represent the strictly mono- case of DDSF), and then summarize the numerically stable
tonically increasing behavior of a function. Also, note that operations that were utilized in this work.
a continuous function is strictly monotonically increasing
exactly when its derivative is greater than 0. C.1. Jacobian of DDSF

Proof. (Proposition 1) Again defining x = h0 and y = hL , the Jacobian of each


DDSF transformation can be written as a sequence of dot
Suppose we have an MLP with L + 1 layers: products due to the chain rule:
h0 , h1 , h2 , ..., hL , x = h0 and y = hL , where x and y are
scalar, and hl,j denotes the j-th node of the l-th [Link] h ih i h i
∇x y = ∇h(L−1) h(L) ∇h(L−2) h(L−1) , · · · , ∇h(0) h(1)
1 ≤ l ≤ L, we have
(19)
T
pl,j = wl,j hl−1 + bl,j (15)
hl,j = Al (pl,j ) (16) For notational convenience, we define a few more interme-
Neural Autoregressive Flows

diate variables. For each layer of DDSF, we have apply softplus to ensure positivity. In Equation 20, we have
(l+1) (l+1) (l+1)
C
| {z }
=a
| {z }
(u · h(l) ) + b| (1+1)
| {z } | {z } {z }
log w = logsoftmax(w )
dl+1 dl+1 dl+1 ×dl dl dl+1 log u = logsoftmax(u )
(l+1) (l+1) (l+1)
D
| {z }
= w
| {z }
· σ(C
| {z }
) log σ(C) = logsigmoid(C)
dl+1 dl+1 ×dl+1 dl+1 log 1 − σ(C) = logsigmoid(−C)
h(l+1)
) = σ −1 (D (l+1)
)
| {z } | {z }
where
dl+1 dl+1

The gradient can be expanded as logsoftmax(x) = x − logsumexp(x)


logsigmoid(x) = − softplus(−x)
X
logsumexp(x) = log( exp(xi − x∗ )) + x∗
∇h(l) h (l+1)
= ∇D σ −1 (D(l+1) ) i
[:,•]
softplus(x) = log(1 + exp(x)) + δ
∇σ(C) D(l+1)
where x∗ = maxi {xi } and δ is a small value such as 10−6 .
∇C σ(C (l+1) )
[•,:] C.2.2. L OGARITHMIC DOT PRODUCT

(l+1)
∇(u(l+1) h(l) ) C ×−1 In both Equation 19 and 20, we encounter matrix/tensor
[•,:] [:,:,•]
product, which is achived by summing over one dimension
∇h(l) u(l+1) h(l) after doing element-wise multiplication. Let M̃1 = log M1
[•,:,:] and M̃2 = log M2 be d0 × d1 and d1 × d2 , respectively.
The logarithmic matrix dot product ? can be written as:
where the bullet • in the subscript indicates the dimen-
sion is broadcasted, denotes element-wise multiplication, M̃1 ?M̃2 =
and ×−1 denotes summation over the last dimension after
element-wise product, logsumexpdim=1 M̃1 + M̃2
[:,:,•] [•,:,:]

1 where the subscript of logsumexp indicates the dimension
=
D(l+1) (1 − D(l+1) ) [:,•] (index starting from 0) over which the elements are to be
summed up. Note that M̃1 ? M̃2 = log (M1 · M2 ).

w(l+1)

σ(C (l+1) ) (1 − σ(C (l+1) )) D. Scalability and Parameter Sharing
[•,:]

As discussed in Section 3.3, a multi-layer NAF such as
a(l+1) ×−1
[•,:] [:,:,•]
DDSF requires the autoregressive conditioner c to output
many pseudo-parameters, on the order of O(Ld2 ), where L
(l+1)
u (20) is the number of layers of the transformer network (τ ), and d
[•,:,:]
is the average number of hidden units per layer. In practice,
we reduce the number of outputs (and thus the computation
C.2. Numerically Stable Operations
and memory requirements) of DDSF by instead endowing τ
Since the Jacobian of each DDSF transformation is chain of with some learned (non-conditional) statistical parameters.
dot products (Equation 19), with some nested multiplicative Specifically, we decompose w and u (the preactivations of
operations (Equation 20), we calculate everything in the τ ’s weights, see section C.2.1) into pseudo-parameters and
log-scale (where multiplication is replaced with addition) to statistical parameters. Take u for example:
avoid unstable gradient signal.
(l+1)
u(l+1) = softmaxdim=1 (v (l+1) + η[•,:] )
C.2.1. L OG ACTIVATION
where v (l+1) is a dl+1 × dl matrix of statistical parameters,
To ensure the summing-to-one and positivity constraints
and η is output by c. See figure D for a depiction.
of u and w, we let the autoregressive conditioner output
pre-activation u and w , and apply softmax to them. We The linear transformation before applying sigmoid resem-
do the same for a by having the conditioner output a and bles conditional weight normalization (CWN) (Krueger
Neural Autoregressive Flows

1.0
Sn*
0.8 S
Sn
0.6

y
0.4
0.2
0.0
4 2 0 2 4
Figure 9. Factorized weight of DDSF. The vi , j are learned param-
eters of the model; only the pseudo-parameters η and a are output
x
by the conditioner. The activation function f is softmax, so adding
η yields an element-wise rescaling of the inputs to this layer of the Figure 10. Visualization of how sigmoidal functions can univer-
transformer by exp(η). sally approximate an monotonic function in [0, 1]. The red dotted
curve is the target monotonic function (S), and blue solid curve is
the intermediate superposition of step functions (Sn∗ ) with the pa-
rameters chosen in the proof. In this case, n is 6 and |Sn∗ − S| ≤ 17 .
The green dashed curve is the resulting superposition of sigmoids
(Sn ), that intercepts with each step of Sn∗ with the additionally
chosen τ .

et al., 2017). While CWN rescales the weight vectors nor-


malized to have unit L2 norm, here we rescale the weight
F. Lemmas: Uniform Convergence of DSF
vector normalized by softmax such that it sums to one and We want to show the convergence result of Equation 11. To
is positive. We call this conditional normalized weight expo- this end, we first show that DSF can be used to universally
nentiation. This reduces the number of pseudo-parameters approximate any strictly monotonic function. The is the
to O(Ld). case where x1:t−1 are fixed, which means C(x1:t−1 ) are
simply constants. We demonstrate it using the following
E. Identity Flow Initialization two lemmas.
Lemma 1. (Step functions universally approximate mono-
In many cases, initializing the transformation to have a min-
tonic functions) Define:
imal effect is believed to help with training, as it can be
thought of as a warm start with a simpler distribution. For n
X
instance, for variational inference, when we initialize the Sn∗ (x) = wj · s (x − bj )
normalizing flow to be an identity flow, the approximate j=1

posterior is at least as good as the input distribution (usually


a fully factorized Gaussian distribution) before the transfor- where s(z) is defined as a step function that is 1 when
mation. To this end, for DSF and DDSF, we initialize the z ≥ 0 and 0 otherwise. For any continuous, strictly
pseudo-weights a to be close to 1, the pseudo-biases b to be monotonically increasing S : [r0 , r1 ] → [0, 1] where
close to 0. S(r0 ) = 0, S(r1 ) = 1 and r0 , r1 ∈ R; and given
any > 0, there exists a positive integer n,Preal con-
This is achieved by initializing the conditioner (whose out- n
stants wj and bj for j = 1, ..., n, where j wj =
puts are the pseudo-parameters) to have small weights
1, wj > 0 and bj ∈ [r0 , r1 ] for all j, such that
and the appropriate output biases. Specifically, we ini-
|Sn∗ (x) − S(x)| < ∀x ∈ [r0 , r1 ].
tialize the output biases of the last layer of our MADE
(Germain et al., 2015) conditioner to be zero, and add
Proof. (Lemma 1)
softplus−1 (1) ≈ 0.5413 to the outputs of which correspond
to a before applying the softplus activation function. We For brevity, we write sj (x) = s(x − bj ). For any >
initialize all conditioner’s weights by sampling from from 0, we choose n = d 1 e, and divide the range (0, 1) into
Unif(−0.001, 0.001). We note that there might be better n + 1 evenly spaced intervals: (0, y1 ), (y1 , y2 ), ..., (yn , 1).
ways to initialize the weights to account for the different For each yj , there is a corresponding inverse value since
numbers of active incoming units. S is strictly monotonic, xj = S −1 (yj ). We want to set
Neural Autoregressive Flows

Sn∗ (xj ) = yj for 1 ≤ j ≤ n − 1 and Sn∗ (xn ) = 1. To do so, Let 1 = 13 and 2 = 23 . We know that for this 1 , there
we set the bias terms bj to be xj . P
Then we just need to solve exists an n such that |Sn∗ − S| < 1 .
n
a system of n linear equations j 0 =1 wj 0 · sj 0 (xj ) = tj ,
We chose the same wj , bj for j = 1, ..., n as the ones used
where tj = yj for 1 ≤ j < n, t0 = 0 and tn = 1. We
in the proof of Lemma 1, and let τ1 , ..., τn all be the same
can express this system of equations in the matrix form as
value denoted by τ .
Sw = t, where:
κ
Take κ = minj6=j 0 |bj − bj 0 | and τ = σ−1 (1− 0)
for some
Sj,j 0 = sj 0 (xj ) = δxj ≥bj0 = δj≥j 0 ,
0 > 0. Take Γ to be a lower triangular matrix with values
wj = wj , tj = tj of 0.5 on the diagonal and 1 below the diagonal.

where δω≥η = 1 whenever ω ≥ η and δω≥η = 0 otherwise. max |Sn (bj ) − Γj · w|


Then we have w = S−1 t. Note that S is a lower triangular j=1,...,n

matrix, and its inverse takes the form of a Jordan matrix: X


bj − bj 0
X
(S−1 )i,i = 1 and (S−1 )i+1,i = −1. Additionally, tj − = max wj 0 σ − wj 0 Γjj 0
1 2 τ
tj−1 = n+1 for j = 1, ..., n − 1 and is equal to n+2 for j0 j0
∗ T −T
j = n. We then have Sn (x) = t S s(x), where s(x)j =
X bj − bj 0
sj (x); thus = max wj 0 σ − Γjj 0
τ
j0
n
X
|Sn∗ (x)
X
− S(x)| = | sj (x)(tj − tj−1 ) − S(x)| < max w j 0 0 = 0
j=1 j0
n−1
1 X 2sn (x)
= | sj (x) + − S(x)| The inequality is due to
n + 1 j=1 n+1

Cy (x) 2δx≥yn bj − bj 0 bj − bj 0
= | 1:n−1 + − S(x)| σ =σ σ −1 (1 − 0 )
n+1 n+1 γ mink6=k0 bk − bk0
if j = j 0


1
<
1
≤ (21)  = 0.5

n+1 d1/e ≥ 1 − 0 if j > j 0
if j < j 0

≤ 0
P 
where Cv (z) = k δz≥vk is the count of elements in a
vector that z is no smaller than.
Since the product Γ · w represents the half step points of
Sn∗ at x = bj ’s, the result above entails |Sn (x) − Sn∗ (x)| <
Note that the additional constraint that w lies on an n − 1 1
2 = 21 for all x. To see this, we choose 0 = 2(n+1) .
dimensional simplex is always satisfied, because ∗
Then Sn intercepts with all segments of Sn except for the
X X ends. We choose 2 = 21 since the last step of Sn∗ is of
wj = tj − tj−1 = tn − t0 = 1 2
size n+1 , and thus the bound also holds true in the vicinity
j j
where Sn intercepts with the last step of Sn∗ .
See Figure 10 for a visual illustration of the proof. Using this Hence,
result, we now turn to the case of using sigmoid functions
instead of step functions. |Sn (x) − S(x)|
Lemma 2. (Superimposed sigmoids universally approxi- ≤ |Sn (x) − Sn∗ (x)| + |Sn∗ (x) − S(x)| < 1 + 2 =
mate monotonic functions) Define:
n
X x − bj
Sn (x) = wj · σ
j=1
τj
Now we show that (the pre-logit) DSF (Equation 11) can
With the same constraints and definition in Lemma 1, universally approximate monotonic functions. We do this by
given any > 0, there exists a positive integer n, showing that the well-known universal function approxima-
real constants wj , τj and bj for j = 1, ..., n, where tion properties of neural networks (Cybenko, 1989) allow us
additionally τj are bounded and positive, such that to produce parameters which are sufficiently close to those
|Sn (x) − S(x)| < ∀x ∈ (r0 , r1 ). required by Lemma 2.
Lemma 3. Let x1:m ∈ [r0 , r1 ]m where r0 , r1 ∈ R. Given
Proof. (Lemma 2) any > 0 and any multivariate continuously differentiable
Neural Autoregressive Flows

function 11 S(x1:m )t = St (xt , x1:t−1 ) for t ∈ [1, m] that is (wtj , btj , τtj )nj=1 in Lemma 2 depends on the quantiles of
strictly monotonic with respect to the first argument when St (xt , ·) as a function of x1:t−1 ; since the quantiles are
the second argument is fixed, where the boundary values are continuous functions of x1:t−1 , so is C n (x1:t−1 ), and the
St (r0 , x1:t−1 ) = 0 and St (r1 , x1:t−1 ) = 1 for all x1:t−1 theorem applies.
and t, then there exists a multivariate function S such that
Now, S has bounded derivative wrt C, and is thus uniformly
k S(x1:m ) − S(x1:m )k∞ < for all x1:m , of the following
continuous, so long as τ is greater than some positive con-
form:
stant, which is always the case for any fixed C n , and thus
S(x1:m )t = S t (xt , C t (x1:t−1 )) can be guaranteed for C k as well (for large enough k). Uni-
n form continuity allows us into translate the convergence
X xt − btj (x1:t−1 ) of C k → C n to convergence of S n (xt , C k (x1:t−1 )) →
= wtj (x1:t−1 ) · σ
j=1
τtj (x1:t−1 ) S n (xt , C n (x1:t−1 )), since for any , there exists a δ > 0
such that
where t ∈ [1, m], and Ct = (wtj , btj , τtj )nj=1 are func-
k C k (x1:t−1 ) − C n (x1:t−1 )k∞ < δ
tions of x1:1−t parameterized by neuralPnetworks, with τtj
n
bounded and positive, btj ∈ [r0 , r1 ], j=1 wtj = 1, and =⇒ |S n (xt , C k (x1:t−1 )) − S n (xt , C n (x1:t−1 ))| <
2
wtj > 0 (23)

Proof. (Lemma 3) Combining this with Equation 22, we have for all xt and
First we deal with the univariate case (for any t) and drop x1:t−1 , and for all n ≥ N and k ≥ K
the subscript t of the functions. We write S n and C k to S n (xt , C k (x1:t−1 )) − S(xt , xx1:t−1 )
denote the sequences of univariate functions. We want to
≤ |S n (xt , C k (x1:t−1 )) − S n (xt , C n (x1:t−1 ))| +
show that for any > 0, there exist (1) a sequence of
functions S n (xt , C k (x1:t−1 )) in the given form, and (2) S n (xt , C n (x1:t−1 )) − S(xt , xx1:t−1 )
large enough N and K such that when n ≥ N and k ≥ K,
< + =
| S n (xt , C k (x1:t−1 )) − S(xt , x1:t−1 )| ≤ for all x1:t ∈ 2 2
[r0 , r1 ]t .
Having proved the univariate case, we add back the subscript
The idea is first to show that we can find a sequence of t to denote the index of the function. From the above, we
parameters, C n (x1:t−1 ), that yield a good approximation know that given any > 0 for each t, there exist N (t) and
of the target function, S(xt , x1:t−1 ). We then show that a sequence of univariate functions S n,t such that for all
these parameters can be arbitrarily well approximated by n ≥ N (t), | S n,t −St | < for all x1:t . Choosing N =
the outputs of a neural network, C k (x1:t−1 ), which in turn maxt∈[1,m] N (t), we have that there exists a sequence of
yield a good approximation of S. multivariate functions S n in the given form such that for all
From Lemma 2, we know that such a sequence C n (x1:t−1 ) n ≥ N , k S n −Sk∞ < for all x1:m .
exists, and furthermore that we can, for any , and inde-
pendently of S and x1:t−1 choose an N large enough so
that:
G. Proof of Universal Approximation of DSF

| S n (xt , C n (x1:t−1 )) − S(xt , x1:t−1 )| < (22) Lemma 4. Let X ∈ X be a random variable, and X ⊆ Rm
2
and Y ⊆ Rm . Given any function J : X → Y and a se-
To see that we can further approximate a given C n (x1:t−1 ) quences of functions Jn that converges pointwise to J, the
well by C k (x1:t−1 ) 12 , we apply the classic result of Cy- sequence of random variables induced by the transforma-
. .
benko (1989), which states that a multilayer perceptron tions Yn = Jn (X) converges in distribution to Y = J(X).
can approximate any continuous function on a compact
subset of Rm . Note that specification of C n (x1:t−1 ) = Proof. (Lemma 4)
11
S(·) : [r0 , r1 ]m → [0, 1]m is a multivariate-multivariable Let h be any bounded, continuous function on Rm , so
function, where S(·)t is its t-th component, which is a univariate- that h ◦ Jn converges pointwise to h ◦ J by continuity
multivariable function, written as St (·, ·) : [r0 , r1 ]×[r0 , r1 ]t−1 → of h. Since h is bounded, then by the dominated con-
[0, 1]. vergence theorem, E[h(Yn )] = E[h(Jn (X))] converges
12
Note that C n is a chosen function (we are not assuming its to E[h(J(X)] = E[h(Y )]. As this result holds for any
parameterization; i.e. not necessarily a neural network) that we
seek to approximate using C k , which is the output of a neural
bounded continuous function h, by the Portmanteau’s
d
network. lemma, we have Yn −→Y.
Neural Autoregressive Flows

Proof. (Proposition 2) G ◦ H, by continuity of σ −1 . Since Kn converges point-


wise to G ◦ H and G(H(X)) = Y , by Lemma 4, we have
Given an arbitrary ordering, let F be the CDFs of Y , de- d
fined as Ft (yt , x1:t−1 ) = Pr(Yt ≤ yt |x1:t−1 ). According Yn −
→Y
to Theorem 1 of Hyvärinen & Pajunen (1999), F (Y ) is
uniformly distributed in the cube [0, 1]m . F has an upper
triangular Jacobian matrix, whose diagonal entries are con- H. Experimental Details
ditional densities which are positive by assumption. Let G
be a multivariate and multivariable function where Gt is the For the experiment of amortized variational inference, we
inverse of the CDF of Yt : Gt (Ft (yt , x1:t−1 ), x1:t−1 ) = yt . implement the Variational Autoencoder (Kingma & Welling,
2013). Specifically, we follow the architecture used in
According to Lemma 3, there exists a sequence of functions Kingma et al. (2016): the encoder has three layers with
in the given form (S n )n≥1 that converge uniformly to σ ◦ G. [16, 32, 32] feature maps. We use resnet blocks (He et al.,
Since uniform convergence implies pointwise convergence, 2016) with 3 × 3 convolution filters and a stride of 2 to
Gn = σ −1 ◦ Sn converges pointwise to G, by continuity of downsize the feature maps. The convolution layers are fol-
σ −1 . Since Gn converges pointwise to G and G(X) = Y , lowed by a fully connected layer of size 450 as a context for
d
by Lemma 4, we have Yn −
→Y the flow layers that transform the noise sampled from a stan-
dard normal distribution of dimension 32. The decoder is
symmetrical with the encoder, with the strided convolution
Proof. (Proposition 3) replaced by a combination of bilinear upsampling and regu-
lar resnet convolution to double the feature map size. We
Given an arbitrary ordering, let H be the CDFs of X:
used the ELUs activation function (Clevert et al., 2015) and
. weight normalization (Salimans & Kingma, 2016) in the en-
y1 = H1 (x1 , ∅) = F1 (x1 , ∅) = Pr(X1 ≤ x1 |∅)
. coder and decoder. In terms of optimization, Adam (Kingma
yt = Ht (xt , x1:t−1 ) et al., 2015) is used with learning rate fined tuned for each
= Ft xt , {Ht−t0 (xt−t0 , x1:t−t0 −1 )}t−1

t0 =1 inference setting, and Polyak averaging (Polyak & Juditsky,
= Pr(Xt ≤ xt |y1:t−1 ) for 2 ≤ t ≤ m 1992) was used for evaluation with α = 0.998 which stands
for the proportion of the past at each time step. We also
Due to Hyvärinen & Pajunen (1999), y1 , ...ym are indepen- consider a variant of Adam known as Amsgrad (Reddi et al.,
dently and uniformly distributed in (0, 1)m . 2018) as a hyperparameter. For vanilla VAE, we simply ap-
ply a resnet dot product with the context vector to output the
According to Lemma 3, there exists a sequence of functions mean and the pre-softplus standard deviation, and transform
in the given form (S n )n≥1 that converge uniformly to H. each dimension of the noise vector independently. We call
Since Hn = S n converges pointwise to H and H(X) = Y , this linear flow. For IAF-affine and IAF-DSF, we employ
d
by Lemma 4, we have Yn −
→Y MADE (Germain et al., 2015) as the conditioner c(x1:t−1 ),
and we apply dot product on the context vector to output a
scale vector and a bias vector to conditionally rescale and
Proof. (Theorem 1) shift the preactivation of each layer of the MADE. Each
Given an arbitrary ordering, let H be the CDFs of X defined MADE has one hidden layer with 1920 hidden units. The
the same way in the proof for Proposition 3, and let G be the IAF experiments all start with a linear flow layer followed
inverse of the CDFs of Y defined the same way in the proof by IAF-affine or IAF-DSF transformations. For DSF, we
for Proposition 2. Due to Hyvärinen & Pajunen (1999), choose d = 16.
H(X) is uniformly distributed in (0, 1)m , so G(H(X)) = For the experiment of density estimation with MAF, we
Y . Since Ht (xt , x1:t−1 ) is monotonic wrt xt given x1:t−1 , followed the implementation of Papamakarios et al. (2017).
and Gt (Ht , H1:t−1 ) is monotonic wrt Ht given H1:t−1 , Gt Specifically for each dataset, we experimented with both 5
is also monotonic wrt xt given x1:t−1 , as and 10 flow layers, followed by one linear flow layer. The
following table specifies the number of hidden layers and
∂Gt (Ht , H1:t−1 ) ∂Gt (Ht , H1:t−1 ) ∂Ht (xt , x1:t−1 )
= the number of hidden units per hidden layer for MADE:
∂xt ∂Ht ∂xt
is always positive. Table 3. Architecture specification of MADE in the MAF experi-
According to Lemma 3, there exists a sequence of func- ment. Number of hidden layers and number of hidden units.
tions in the given form (S n )n≥1 that converge uniformly POWER GAS HEPMASS MINIBOONE BSDS300
to σ ◦ G ◦ H. Since uniform convergence implies point-
2 × 100 2 × 100 2 × 512 1 × 512 2 × 1024
wise convergence, Kn = σ −1 ◦ Sn converges pointwise to

You might also like