Idea 2024 (L) Neural Autoregressive Flows
Idea 2024 (L) Neural Autoregressive Flows
state-of-the-art results in density estimation, via izing more complex distributions. This is important since
Masked Autoregressive Flows (MAF) (Papa- a poor variational approximation to the posterior can fail
makarios et al., 2017), and to accelerate state- to reflect the right amount of uncertainty, and/or be biased
of-the-art WaveNet-based speech synthesis to (Turner & Sahani, 2011), resulting in inaccurate and un-
20x faster than real-time (Oord et al., 2017), reliable predictions. We are thus interested in improving
via Inverse Autoregressive Flows (IAF) (Kingma techniques for normalizing flows.
et al., 2016). We unify and generalize these ap-
proaches, replacing the (conditionally) affine uni- Recent work by Kingma et al. (2016) reinterprets autore-
variate transformations of MAF/IAF with a more gressive models as invertible transformations suitable for
general class of invertible univariate transforma- constructing normalizing flows. The inverse transformation
tions expressed as monotonic neural networks. process, unlike sampling from the autoregressive model, is
We demonstrate that the proposed neural autore- not sequential and thus can be accelerated via parallel com-
gressive flows (NAF) are universal approxima- putation. This allows multiple layers of transformations to
tors for continuous probability distributions, and be stacked, increasing expressiveness for better variational
their greater expressivity allows them to better inference (Kingma et al., 2016) or better density estimation
capture multimodal target distributions. Experi- for generative models (Papamakarios et al., 2017). Stacking
mentally, NAF yields state-of-the-art performance also makes it possible to improve on the sequential condi-
on a suite of density estimation tasks and outper- tional factorization assumed by autoregressive models such
forms IAF in variational autoencoders trained on as PixelRNN or PixelCNN (Oord et al., 2016), and thus
binarized MNIST. 1 define a more flexible joint probability.
We note that the normalizing flow introduced by Kingma
et al. (2016) only applies an affine transformation of each
1. Introduction scalar random variable. Although this transformation is
conditioned on preceding variables, the resulting flow can
Invertible transformations with a tractable Jacobian, also still be susceptible to bad local minima, and thus failure to
known as normalizing flows, are useful tools in many ma- capture the multimodal shape of a target density; see Figure
chine learning problems, for example: (1) In the context 1 and 2.
of deep generative models, training necessitates evaluat-
ing data samples under the model’s inverse transformation 1.1. Contributions of this work
(Dinh et al., 2016). Tractable density is an appealing prop-
erty for these models, since it allows the objective of interest We propose replacing the conditional affine transformation
to be directly optimized; whereas other mainstream methods of Kingma et al. (2016) with a more rich family of trans-
rely on alternative losses, in the case of intractable density formations, and note the requirements for doing so. We
models (Kingma & Welling, 2013; Rezende et al., 2014), or determine that very general transformations, for instance
parametrized by deep neural networks, are possible. We
*
Equal contribution 1 MILA, University of Montreal 2 Element then propose and evaluate several specific monotonic neural
AI 3 CIFAR fellow. Correspondence to: Chin-Wei Huang <chin-
[Link]@[Link]>.
network architectures which are more suited for learning
multimodal distributions. Concretely, our method amounts
Preliminary work. Under review by the International Conference to using an autoregressive model to output the weights of
on Machine Learning (ICML). Do not distribute. multiple independent transformer networks, each of which
1
Implementation can be found at [Link] operates on a single random variable, replacing the affine
Huang/NAF/
Neural Autoregressive Flows
Empirically, we show that our method works better than In a typical application of variational inference (Rezende
the state-of-the-art affine autoregressive flows of Kingma & Mohamed, 2015; Kingma et al., 2016), ptarget (y) is a
et al. (2016) and Papamakarios et al. (2017), both as a complex posterior over latent variables y, and f transforms
sample generator which captures multimodal target densities a simple input distribution (for instance a standard normal
with higher fidelity, and as a density model which more distribution) over x into a complex approximate posterior
accurately evaluates the likelihood of data samples drawn pY (y) (unstructured to structured). In either case, since pX
from an unknown distribution. does not depend on θ, the gradients of the KL-divergence
are typically estimated by Monte Carlo:
We also demonstrate that our method is a universal approxi-
mator on proper distributions in real space, which guaran- ∇θ DKL pY (y)||ptarget (y)
tees the expressiveness of the chosen parameterization and
Z
pY (y)
supports our empirical findings. = ∇θ pY (y) log dy
Y p target (y)
Z
pY (y)
2. Background = pX (x)∇θ log d x (2)
X p target (y)
A (finite) normalizing flow (NF), or flow, is an invertible Applying the change of variables formula from Equation 1
function fθ : X → Y used to express a transformation be- to the right hand side of Equation 2 yields:
tween random variables 2 . Since f is invertible, the change " #
of variables formula can be used to translate between densi- −1
∂fθ (x)
ties pY (y) and pX (x): Ex∼pX (x) ∇θ log pX (x) − ∇θ log ptarget (y)
y=fθ (x) ∂x
−1 (3)
∂f (x)
pY (y) = pX (x) (1)
∂x 3
It may also be possible to form a generative model from such
a flow, by passing samples from the prior ptarget (y) through f −1 ,
The determinant of f ’s Jacobian appears on the right hand
although the cost of doing so may vary. For example, RealNVP
side to account for the way in which f can (locally) ex- (Dinh et al., 2016) was devised as a generative model, and its in-
pand or contract regions of X, thereby lowering or raising verse computation is as cheap as its forward computation, whereas
the resulting density in those regions’ images in Y . Since MAF (Papamakarios et al., 2017) is designed for density estima-
tion and is much more expensive to sample from. For the NAF
2
We use x and y to denote inputs and outputs of a function, architectures we employ, we do not have an analytic expression
not the inputs and targets of a supervised learning problem. for f −1 , but it is possible to approximate it numerically.
Neural Autoregressive Flows
Thus for efficient training, the following operations must be τ (µ, σ, xt ) = σxt + (1 − σ)µ (6)
tractable and cheap:
with σ produced by a sigmoid nonlinearity. Such transform-
ers are trivially invertible, but their relative simplicity also
1. Sampling x ∼ pX (x) means that the expressivity of f comes entirely from the
complexity of c and from stacking multiple AAFs (poten-
2. Computing y = f (x)
tially using different orderings of the variables) 7 . However,
3. Computing the gradient of the log-likelihood of y = the only requirements on τ are:
f (x); x ∼ pX (x) under both pY (y) and ptarget (y)
1. The transformer τ must be invertible as a function of
4. Computing the gradient of the log-determinant of the xt .
Jacobian of f
dyt
2. dxt must be cheap to compute.
Research on constructing NFs, such as our work, focuses
on finding ways to parametrize flows which meet the above This raises the possibility of using a more powerful trans-
requirements while being maximally flexible in terms of the former in order to increase the expressivity of the flow.
transformations which they can represent. Note that some
of the terms of of Equation 3 may be constant with respect
3. Neural Autoregressive Flows
to θ 4 and thus trivial to differentiate, such as pX (x) in the
maximum likelihood setting. We propose replacing the affine transformer used in previous
5 works with a neural network, yielding a more rich family of
Affine autoregressive flows (AAFs) , such as inverse au-
distributions with only a minor increase in computation and
toregressive flows (IAF) (Kingma et al., 2016), are one
memory requirements. Specifically,
4
There might be some other parameters other than θ that are
learnable, such as parameters of pX and ptarget in the variational τ (c(x1:t−1 ), xt ) = DNN(xt ; φ = c(x1:t−1 )) (7)
inference and maximum likelihood settings, respectively.
5
Our terminology differs from previous works, and hence holds
6
Dinh et al. (2014) use m and g −1 to denote c and τ , and refer
the potential for confusion, but we believe it is apt. Under our to them as the “coupling function” and “coupling law”, respec-
unifying perspective, NAF, IAF, AF, and MAF all make use of the tively.
7
same principle, which is an invertible transformer conditioned on Permuting the order of variables is itself a normalizing flow
the outputs of an autoregressive (and emphatically not an inverse that does not expand or contract space and can be inverted by
autoregressive) conditioner. another permutation.
Neural Autoregressive Flows
0.45
5
0.40 0.5
4
0.35
3 0.4
0.30
y= c(x)
2
0.25
p(x)
p(y)
0.3
0.20 1
0.15 0 0.2
0.10 1
0.1
0.05 2
0.00 3 0.0
2 0 2 4 4 2 0 2 4 1 0 1 2 3 4
x x y
0.45
0.40 6
0.35 4 0.8
0.30 2
y= c(x)
0.6
0.25 0
p(x)
p(y)
0.20 2
0.4
0.15 4
0.10 6 0.2
0.05 8
0.00 0.0
2 0 2 4 4 2 0 2 4 6 4 2 0 2 4
x x y
(a) Neural autoregressive flows (NAF) Figure 5. Illustration of the effects of traditional IAF (top), and our
proposed NAF (bottom). Areas where the slope of the transformer
τc is greater/less than 1, are compressed/expanded (respectively)
in the output distribution. Inflection points in τc (xt ) (middle) can
transform a unimodal p(xt ) (left) into a multimodal p(yt ) (right);
NAF allows for such inflection points, whereas IAF does not.
softmax-weighted sums: matrix-matrix products for the forward and backwards prop-
agation through the graph of τc . In particular, we use a tech-
yt = σ −1 (w T
| {z }
·σ( |{z}
a · xt + |{z}
b ))) (8) nique similar to conditional weight normalization (CWN)
|{z}
1×d d×1 1×1 d×1 (Krueger et al., 2017) in our experiments with DDSF; see
P appendix for details.
where 0 < wi,j < 1, i wi,j = 1, as,t > 0, and d denotes
the number of hidden units 10 . 3.3. Possibilities for Alternative Architectures
Since all of the sigmoid activations are bounded between While the DSF and DDSF architectures performed well in
0 and 1, the final preactivation (which is their convex com- our experiments, there are many alternatives to be explored.
bination) is as well. The complete DSF transformation can One possibility is using other (strictly) monotonic activation
be seen as mapping the original random variable to a dif- functions in τc , such as leaky ReLUs (LReLU) (Xu et al.,
ferent space through an activation function, where doing 2015) or ELUs (Clevert et al., 2015). Leaky ReLUs in
affine/linear operations is non-linear with respect to the vari- particular are bijections on R and so would not require
able in the original space, and then mapping it back to the the softmax-weighted summation and activation function
original space through the inverse activation. inversion tricks discussed in the previous section.
When stacking multiple sigmoidal transformation, we real- Finally, we emphasize that in general, τ need not be ex-
ize it resembles an MLP with bottleneck as shown by the pressed as a neural architecture; it only needs to satisfy the
bottom left of Figure 4. A more general alternative is the requirements of invertibility and differentiability given at
deep dense sigmoidal flow (DDSF), which takes the form the end of section 2.
of a fully connected MLP:
dom variables. (3) Combining the two, we further show as inverse autoregressive flows (IAF) and further devel-
that DSF can transform any structured noise variable into a oped by Chen et al. (2016) and Papamakarios et al. (2017)
random variable with any desired distribution. as autoregressive flows (AF) and masked autoregressive
flows (MAF), respectively; for details on their relation-
We define the following notation for the pre-logit of the DSF
ship to our work see Sections 2 and 3. While Dinh et al.
transformation (compare equation 8):
(2014) draw a particular connection between their NICE
n model and the Neural Autoregressive Density Estimator
X xt − bj (x1:t−1 )
S(xt , C(x1:t−1 )) = wj (x1:t−1 )·σ (NADE) (Larochelle & Murray, 2011), (Kingma et al.,
τj (x1:t−1 )
j=1 2016) were the first to highlight the general approach of
(11) using autoregressive models to construct normalizing flows.
where C = (wj , bj , τj )nj=1 are functions of x1:1−t param- Chen et al. (2016) and then Papamakarios et al. (2017) sub-
Pn Let bj be in (r0 , r1 ); τj be
eterized by neural networks. sequently noticed that this same approach could be used
bounded and positive; j=1 wj = 1 and wj > 0. See efficiently in reverse when the key operation is evaluating,
Appendix F and G for the proof. as opposed to sampling from, the flow’s learned output
Proposition 2. (DSF universally transforms uniform ran- density. Our method increases the expressivity of these
dom variables into any desired random variables) Let Y be previous approaches by using a neural net to output pseudo-
a random vector in Rm and assume Y has a strictly pos- parameters of another network, thus falling into the hyper-
itive and continuous probability density distribution. Let network framework (Ha et al., 2016; Bertinetto et al., 2016;
X ∼ Unif((0, 1)m ). Then there exists a sequence of func- Brabandere et al., 2016).
tions (Gn )n≥1 parameterized by autoregressive neural net- There has been a growing interest in normalizing flows
works in the following form (NFs) in the deep learning community, driven by success-
ful applications and structural advantages they have over
G(x)t = σ −1 (S (xt ; C t (x1:t−1 ))) (12) alternatives. Rippel & Adams (2013), Rezende & Mohamed
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such (2015) and Dinh et al. (2014) first introduced normalizing
. flows to the deep learning community as density models,
that Yn = Gn (X) converges in distribution to Y .
variational posteriors and generative models, respectively.
Proposition 3. (DSF universally transforms any random In contrast to traditional variational posteriors, NFs can
variables into uniformly distributed random variables) Let X represent a richer family of distributions without requir-
be a random vector in an open set U ⊂ Rm . Assume X has ing approximations (beyond Monte Carlo estimation of the
a positive and continuous probability density distribution. KL-divergence). The NF-based RealNVP-style generative
Let Y ∼ Unif((0, 1)m ). Then there exists a sequence of models (Dinh et al., 2016; 2014) also have qualitative ad-
functions (Hn )n≥1 parameterized by autoregressive neural vantages over alternative approaches. Unlike generative
networks in the following form adversarial networks (GANs) (Goodfellow et al., 2014)
H(x)t = S (xt ; C t (x1:t−1 )) (13) and varational autoencoders (VAEs) (Kingma & Welling,
2013; Rezende et al., 2014), computing likelihood is cheap.
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such Unlike autoregressive generative models, such as pixelC-
. NNs (Oord et al., 2016), sampling is also cheap. Unfortu-
that Yn = Hn (X) converges in distribution to Y .
Theorem 1. (DSF universally transforms any random vari- nately, in practice RealNVP-style models are not currently
ables into any desired random variables) Let X be a random competitive with autoregressive models in terms of like-
vector in an open set U ⊂ Rm . Let Y be a random vector lihood, perhaps due to the more restricted nature of the
in Rm . Assume both X and Y have a positive and contin- transformations they employ.
uous probability density distribution. Then there exists a Several promising recent works expand the capabilities of
sequence of functions (Kn )n≥1 parameterized by autore- NFs for generative modeling and density estimation, how-
gressive neural networks in the following form ever. Perhaps the most exciting example is Oord et al.
(2017), who propose the probability density distillation
K(x)t = σ −1 (S (xt ; C t (x1:t−1 ))) (14) technique to train an IAF (Kingma et al., 2016) based on
where C t = (atj , btj , τtj )nj=1 are functions of x1:t−1 , such the autoregressive WaveNet (van den Oord et al., 2016)
. as a generative model using another pretrained WaveNet
that Yn = Kn (X) converges in distribution to Y .
model to express the target density, thus overcoming the
slow sequential sampling procedure required by the original
5. Related work WaveNet (and characteristic of autoregressive models in
Neural autoregressive flows are a generalization of the affine general), and reaching super-real-time speeds suitable for
autoregressive flows introduced by Kingma et al. (2016) production. The previously mentioned MAF technique (Pa-
Neural Autoregressive Flows
KL(p||q)
KL(q||p)
0.6
average and standard deviation are carried out with 5 trials of 0.4
20
y(t)
0.00
0.25
0.50
f=0.6
0.75
f=1.2
1.00
f=1.8
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
t f
6. Experiments
Our experiments evaluate NAFs on the classic applications
of variational inference and density estimation, where we
Figure 6. Fitting grid of Gaussian distributions using maximum outperform IAF and MAF baselines. We first demonstrate
likelihood. Left: true distribution. Center: affine autoregressive the qualitative advantage NAFs have over AAFs in energy
flow (AAF). Right: neural autoregressive flow (NAF) function fitting and density estimation (Section 6.1). We
then demonstrate the capability of NAFs to capture a mul-
timodal Bayesian posterior in a limited data setting (Sec-
pamakarios et al., 2017) further demonstrates the potential tion 6.2). For larger-scale experiments, we show that using
of NFs to improve on state-of-the-art autoregressive density NAF instead of IAF to approximate the posterior distribution
estimation models; such highly performant MAF models of latent variables in a variational autoencoder (Kingma &
could also be “distilled” for rapid sampling using the same Welling, 2013; Rezende et al., 2014) yields better likelihood
procedure as in Oord et al. (2017). results on binarized MNIST (Larochelle & Murray, 2011)
(Section 6.3). Finally, we report our experimental results on
Other recent works also find novel applications of NFs, density estimation of a suite of UCI datasets (Section 6.4).
demonstrating their broad utility. Loaiza-Ganem et al.
(2017) use NFs to solve maximum entropy problems, rather 6.1. Toy energy fitting and density estimation
than match a target distribution. Louizos & Welling (2017)
and Krueger et al. (2017) apply NFs to express approxi- 6.1.1. E XPRESSIVENESS
mate posteriors over parameters of neural networks. Song
First, we demonstrate that, in the case of marginally inde-
et al. (2017) use NFs as a proposal distribution in a novel
pendent distributions, affine transformation can fail to fit the
Metropolis-Hastings MCMC algorithm.
true distribution. We consider a mixture of Gaussian density
Finally, there are also several works which develop new estimation task. We define the modes of the Gaussians to
techniques for constructing NFs that are orthogonal to ours be laid out on a 2D meshgrid within the range [-5,5], and
(Tomczak & Welling, 2017; 2016; Gemici et al., 2016; Du- consider 2, 5 and 10 modes on each dimension. While the
venaud et al., 2016; Berg et al., 2018). affine flow only produces a single mode, the neural flow
Neural Autoregressive Flows
Table 2. Test log-likelihood and error bars of 2 standard deviations on the 5 datasets (5 trials of experiments). Neural autoregressive flows
(NAFs) produce state-of-the-art density estimation results on all 5 datasets. The numbers (5 or 10) in parantheses indicate the number of
transformations which were stacked; for TAN (Oliva et al., 2018), we include their best results, achieved using different architectures on
different datasets. We also include validation results to give future researchers a fair way of comparing their methods with ours during
development.
Model POWER GAS HEPMASS MINIBOONE BSDS300
MADE MoG 0.40 ± 0.01 8.47 ± 0.02 −15.15 ± 0.02 −12.27 ± 0.47 153.71 ± 0.28
MAF-affine (5) 0.14 ± 0.01 9.07 ± 0.02 −17.70 ± 0.02 −11.75 ± 0.44 155.69 ± 0.28
MAF-affine (10) 0.24 ± 0.01 10.08 ± 0.02 −17.73 ± 0.02 −12.24 ± 0.45 154.93 ± 0.28
MAF-affine MoG (5) 0.30 ± 0.01 9.59 ± 0.02 −17.39 ± 0.02 −11.68 ± 0.44 156.36 ± 0.28
TAN (various architectures) 0.48 ± 0.01 11.19 ± 0.02 −15.12 ± 0.02 −11.01 ± 0.48 157.03 ± 0.07
MAF-DDSF (5) 0.62 ± 0.01 11.91 ± 0.13 −15.09 ± 0.40 −8.86 ± 0.15 157.73 ± 0.04
MAF-DDSF (10) 0.60 ± 0.02 11.96 ± 0.33 −15.32 ± 0.23 −9.01 ± 0.01 157.43 ± 0.30
MAF-DDSF (5) valid 0.63 ± 0.01 11.91 ± 0.13 15.10 ± 0.42 −8.38 ± 0.13 172.89 ± 0.04
MAF-DDSF (10) valid 0.60 ± 0.02 11.95 ± 0.33 15.34 ± 0.24 −8.50 ± 0.03 172.58 ± 0.32
matches the target distribution quite well even up to a 10x10 (Table 1). Here again the DSF architecture outperforms
grid with 100 modes (see Figure 5). both standard IAF and the traditional independent Gaussian
posterior by a statistically significant margin.
6.1.2. C ONVERGENCE
6.4. Density Estimation with Masked Autoregressive
We then repeat the experiment that produces Figure 1 and 2
Flows
16 times, smooth out the learning curve and present average
convergence result of each model with its corresponding We replicate the density estimation experiments of Papa-
standard deviation. For affine flow, we stack 6 layers of makarios et al. (2017), which compare MADE (Germain
transformation with reversed ordering. For DSF and DDSF et al., 2015) and RealNVP (Dinh et al., 2016) to their pro-
we used one transformation. We set d = 16 for both, L = 2 posed MAF model (using either 5 or 10 layers of MAF) on
for DDSF. BSDS300 (Martin et al., 2001) as well as 4 UCI datasets
(Lichman, 2013) processed as in Uria et al. (2013). Simply
6.2. Sine Wave experiment replacing the affine transformer with our DDSF architecture
in their best performing architecture for each task (keeping
Here we demonstrate the ability of DSF to capture multi-
all other settings fixed) results in substantial performance
modal posterior distributions. To do so, we create a toy
gains, and also outperforms the more recent Transformation
experiment where the goal is to infer the posterior over the
Autoregressive Networks (TAN) Oliva et al. (2018), setting
frequency of a sine wave, given only 3 datapoints. We fix
a new state-of-the-art for these tasks. Results are presented
the form of the function as y(t) = sin(2πf · t) and spec-
in Table 2.
ify a Uniform prior over the frequency: p(f ) = U ([0, 2]).
The task is to infer the posterior distribution p(f |T, Y )
given the dataset (T, Y ) = ((0, 5/6, 10/6), (0, 0, 0)), as 7. Conclusion
represented by the red crosses of Figure 8 (left). We as-
In this work we introduce the neural autoregressive flow
sume the data likelihood given the frequency parameter to
(NAF), a flexible method of tractably approximating rich
be p(yi |ti , f ) = N (yi ; yf (ti ), 0.125), where the variance
families of distributions. In particular, our experiments show
σ 2 = 0.125 represents the inherent uncertainty of the data.
that NAF is able to model multimodal distributions and
Figure 8 (right) shows that DSF learns a good posterior in
outperform related methods such as inverse autoregressive
this task.
flow in density estimation and variational inference. Our
work emphasizes the difficulty and importance of capturing
6.3. Amortized Approximate Posterior multimodality, as previous methods fail even on simple toy
We evaluate NAF’s ability to improve variational inference, tasks, whereas our method yields significant improvements
in the context of the binarized MNIST (Larochelle & Mur- in performance.
ray, 2011) benchmark using the well-known variational au-
toencoder (Kingma & Welling, 2013; Rezende et al., 2014)
Neural Autoregressive Flows
Gemici, Mevlana C, Rezende, Danilo, and Mohamed, Loaiza-Ganem, Gabriel, Gao, Yuanjun, and Cunningham,
Shakir. Normalizing flows on riemannian manifolds. John P. Maximum entropy flow networks. arXiv preprint
arXiv preprint arXiv:1611.02304, 2016. arXiv:1701.03504, 2017.
Neural Autoregressive Flows
Louizos, Christos and Welling, Max. Multiplicative nor- Tomczak, Jakub M and Welling, Max. Improving variational
malizing flows for variational bayesian neural networks. auto-encoders using householder flow. arXiv preprint
arXiv preprint arXiv:1703.01961, 2017. arXiv:1611.09630, 2016.
Martin, David, Fowlkes, Charless, Tal, Doron, and Malik, Tomczak, Jakub M and Welling, Max. Improving variational
Jitendra. A database of human segmented natural images auto-encoders using convex combination linear inverse
and its application to evaluating segmentation algorithms autoregressive flow. arXiv preprint arXiv:1706.02326,
and measuring ecological statistics. In Computer Vision, 2017.
2001. ICCV 2001. Proceedings. Eighth IEEE Interna-
tional Conference on, volume 2. IEEE, 2001. Turner, Richard E and Sahani, Maneesh. Two problems
with variational expectation maximisation for time-series
Oliva, J. B., Dubey, A., Póczos, B., Schneider, J., and Xing, models. Bayesian Time series models, 2011.
E. P. Transformation Autoregressive Networks. ArXiv
e-prints, January 2018. Uria, Benigno, Murray, Iain, and Larochelle, Hugo. Rnade:
The real-valued neural autoregressive density-estimator.
Oord, Aaron van den, Kalchbrenner, Nal, and Kavukcuoglu, In Advances in Neural Information Processing Systems,
Koray. Pixel recurrent neural networks. arXiv preprint 2013.
arXiv:1601.06759, 2016.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,
Oord, Aaron van den, Li, Yazhe, Babuschkin, Igor, Si- Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,
monyan, Karen, Vinyals, Oriol, Kavukcuoglu, Koray, and Kavukcuoglu, K. WaveNet: A Generative Model for
Driessche, George van den, Lockhart, Edward, Cobo, Raw Audio. ArXiv e-prints, September 2016.
Luis C, Stimberg, Florian, et al. Parallel wavenet:
Fast high-fidelity speech synthesis. arXiv preprint Xu, Bing, Wang, Naiyan, Chen, Tianqi, and Li, Mu. Em-
arXiv:1711.10433, 2017. pirical evaluation of rectified activations in convolutional
network. arXiv preprint arXiv:1505.00853, 2015.
Papamakarios, George, Murray, Iain, and Pavlakou, Theo.
Masked autoregressive flow for density estimation. In Ad-
vances in Neural Information Processing Systems, 2017.
Polyak, Boris T and Juditsky, Anatoli B. Acceleration of
stochastic approximation by averaging. SIAM Journal on
Control and Optimization, 1992.
Reddi, Sashank J, Kale, Satyen, and Kumar, Sanjiv. On
the convergence of adam and beyond. In International
Conference on Learning Representations, 2018.
Rezende, Danilo Jimenez and Mohamed, Shakir. Varia-
tional inference with normalizing flows. arXiv preprint
arXiv:1505.05770, 2015.
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wier-
stra, Daan. Stochastic backpropagation and approximate
inference in deep generative models. arXiv preprint
arXiv:1401.4082, 2014.
Rippel, Oren and Adams, Ryan Prescott. High-dimensional
probability estimation with deep density models. arXiv
preprint arXiv:1302.5125, 2013.
Salimans, Tim and Kingma, Diederik P. Weight normaliza-
tion: A simple reparameterization to accelerate training of
deep neural networks. In Advances in Neural Information
Processing Systems, 2016.
Song, Jiaming, Zhao, Shengjia, and Ermon, Stefano. A-
nice-mc: Adversarial training for mcmc. In Advances in
Neural Information Processing Systems, 2017.
Neural Autoregressive Flows
A. Exclusive KL View of the MLE for some monotonic activation function Al , positive weight
vector wl,j , and bias bl,j . Differentiating this yields
Lets assume a change-of-variable model pZ (z) on the ran-
dom variable Z ∈ Rm , such as the one used in Dinh et al.
dhl,j dAl (pl,j )
(2016): z0 ∼ p0 (z0 ) and z = ψ(z0 ), where ψ is an invert- = · wl,j,k , (17)
ible function and density evaluation of z0 ∈ Rm is tractable dhl−1,k dpl,j
under p0 . The resulting density function can be written
which is greater than 0 since A is monotonic (and hence has
−1 positive derivative), and wl,j,k is positive by construction.
∂ψ(z0 )
pZ (z) = p0 (z0 ) Thus any unit of layer l is monotonic with respect to any
∂z0
unit of layer l − 1, for 1 ≤ l ≤ L.
The maximum likelihood principle requires us to minimize Now, suppose we have J monotonic functions fj of the in-
the following metric: PJ
put x. Then the weighted sum of these functions j=1 uj fj
DKL pdata (z)||pZ (z)
with uj > 0 is also monotonic with respect to x, since
= Epdata [log pdata (z) − log pZ (z)] J J
d X X dfj
∂ψ −1 (z)
uj fj = uj >0 (18)
= Epdata log pdata (z) − log p0 (z0 ) dx j=1 dx
∂z j=1
−1
" #
−1
∂ψ (z)
= Epdata log pdata (z) − log p0 (z0 )
∂z Finally, we use induction to show that all hl,j (including y)
are monotonic with respect to x.
which coincides with exclusive KL divergence; to see this,
we take X = Z, Y = Z0 , f = ψ −1 , pX = pdata , and 1. The base case (l = 1) is given by Equation 17.
ptarget = p0
"
−1
# 2. Suppose the inductive hypothesis holds, which means
∂f (x) hl,j is monotonic with respect to x for all j of layer l.
= EpX log pX (x) − log ptarget (y)
∂x Then by Equation 18, hl+1,k is also monotonic with
= DKL pY (y)||ptarget (y)
respect to x for all k.
This means we want to transform the empirical distribution Thus by mathematical induction, monotonicity of hl,j holds
pdata , or pX , to fit the target density (the usually unstruc- for all l and j.
tured, base distribution p0 ), as explained in Section 2.
diate variables. For each layer of DDSF, we have apply softplus to ensure positivity. In Equation 20, we have
(l+1) (l+1) (l+1)
C
| {z }
=a
| {z }
(u · h(l) ) + b| (1+1)
| {z } | {z } {z }
log w = logsoftmax(w )
dl+1 dl+1 dl+1 ×dl dl dl+1 log u = logsoftmax(u )
(l+1) (l+1) (l+1)
D
| {z }
= w
| {z }
· σ(C
| {z }
) log σ(C) = logsigmoid(C)
dl+1 dl+1 ×dl+1 dl+1 log 1 − σ(C) = logsigmoid(−C)
h(l+1)
) = σ −1 (D (l+1)
)
| {z } | {z }
where
dl+1 dl+1
1.0
Sn*
0.8 S
Sn
0.6
y
0.4
0.2
0.0
4 2 0 2 4
Figure 9. Factorized weight of DDSF. The vi , j are learned param-
eters of the model; only the pseudo-parameters η and a are output
x
by the conditioner. The activation function f is softmax, so adding
η yields an element-wise rescaling of the inputs to this layer of the Figure 10. Visualization of how sigmoidal functions can univer-
transformer by exp(η). sally approximate an monotonic function in [0, 1]. The red dotted
curve is the target monotonic function (S), and blue solid curve is
the intermediate superposition of step functions (Sn∗ ) with the pa-
rameters chosen in the proof. In this case, n is 6 and |Sn∗ − S| ≤ 17 .
The green dashed curve is the resulting superposition of sigmoids
(Sn ), that intercepts with each step of Sn∗ with the additionally
chosen τ .
Sn∗ (xj ) = yj for 1 ≤ j ≤ n − 1 and Sn∗ (xn ) = 1. To do so, Let 1 = 13 and 2 = 23 . We know that for this 1 , there
we set the bias terms bj to be xj . P
Then we just need to solve exists an n such that |Sn∗ − S| < 1 .
n
a system of n linear equations j 0 =1 wj 0 · sj 0 (xj ) = tj ,
We chose the same wj , bj for j = 1, ..., n as the ones used
where tj = yj for 1 ≤ j < n, t0 = 0 and tn = 1. We
in the proof of Lemma 1, and let τ1 , ..., τn all be the same
can express this system of equations in the matrix form as
value denoted by τ .
Sw = t, where:
κ
Take κ = minj6=j 0 |bj − bj 0 | and τ = σ−1 (1− 0)
for some
Sj,j 0 = sj 0 (xj ) = δxj ≥bj0 = δj≥j 0 ,
0 > 0. Take Γ to be a lower triangular matrix with values
wj = wj , tj = tj of 0.5 on the diagonal and 1 below the diagonal.
function 11 S(x1:m )t = St (xt , x1:t−1 ) for t ∈ [1, m] that is (wtj , btj , τtj )nj=1 in Lemma 2 depends on the quantiles of
strictly monotonic with respect to the first argument when St (xt , ·) as a function of x1:t−1 ; since the quantiles are
the second argument is fixed, where the boundary values are continuous functions of x1:t−1 , so is C n (x1:t−1 ), and the
St (r0 , x1:t−1 ) = 0 and St (r1 , x1:t−1 ) = 1 for all x1:t−1 theorem applies.
and t, then there exists a multivariate function S such that
Now, S has bounded derivative wrt C, and is thus uniformly
k S(x1:m ) − S(x1:m )k∞ < for all x1:m , of the following
continuous, so long as τ is greater than some positive con-
form:
stant, which is always the case for any fixed C n , and thus
S(x1:m )t = S t (xt , C t (x1:t−1 )) can be guaranteed for C k as well (for large enough k). Uni-
n form continuity allows us into translate the convergence
X xt − btj (x1:t−1 ) of C k → C n to convergence of S n (xt , C k (x1:t−1 )) →
= wtj (x1:t−1 ) · σ
j=1
τtj (x1:t−1 ) S n (xt , C n (x1:t−1 )), since for any , there exists a δ > 0
such that
where t ∈ [1, m], and Ct = (wtj , btj , τtj )nj=1 are func-
k C k (x1:t−1 ) − C n (x1:t−1 )k∞ < δ
tions of x1:1−t parameterized by neuralPnetworks, with τtj
n
bounded and positive, btj ∈ [r0 , r1 ], j=1 wtj = 1, and =⇒ |S n (xt , C k (x1:t−1 )) − S n (xt , C n (x1:t−1 ))| <
2
wtj > 0 (23)
Proof. (Lemma 3) Combining this with Equation 22, we have for all xt and
First we deal with the univariate case (for any t) and drop x1:t−1 , and for all n ≥ N and k ≥ K
the subscript t of the functions. We write S n and C k to S n (xt , C k (x1:t−1 )) − S(xt , xx1:t−1 )
denote the sequences of univariate functions. We want to
≤ |S n (xt , C k (x1:t−1 )) − S n (xt , C n (x1:t−1 ))| +
show that for any > 0, there exist (1) a sequence of
functions S n (xt , C k (x1:t−1 )) in the given form, and (2) S n (xt , C n (x1:t−1 )) − S(xt , xx1:t−1 )
large enough N and K such that when n ≥ N and k ≥ K,
< + =
| S n (xt , C k (x1:t−1 )) − S(xt , x1:t−1 )| ≤ for all x1:t ∈ 2 2
[r0 , r1 ]t .
Having proved the univariate case, we add back the subscript
The idea is first to show that we can find a sequence of t to denote the index of the function. From the above, we
parameters, C n (x1:t−1 ), that yield a good approximation know that given any > 0 for each t, there exist N (t) and
of the target function, S(xt , x1:t−1 ). We then show that a sequence of univariate functions S n,t such that for all
these parameters can be arbitrarily well approximated by n ≥ N (t), | S n,t −St | < for all x1:t . Choosing N =
the outputs of a neural network, C k (x1:t−1 ), which in turn maxt∈[1,m] N (t), we have that there exists a sequence of
yield a good approximation of S. multivariate functions S n in the given form such that for all
From Lemma 2, we know that such a sequence C n (x1:t−1 ) n ≥ N , k S n −Sk∞ < for all x1:m .
exists, and furthermore that we can, for any , and inde-
pendently of S and x1:t−1 choose an N large enough so
that:
G. Proof of Universal Approximation of DSF
| S n (xt , C n (x1:t−1 )) − S(xt , x1:t−1 )| < (22) Lemma 4. Let X ∈ X be a random variable, and X ⊆ Rm
2
and Y ⊆ Rm . Given any function J : X → Y and a se-
To see that we can further approximate a given C n (x1:t−1 ) quences of functions Jn that converges pointwise to J, the
well by C k (x1:t−1 ) 12 , we apply the classic result of Cy- sequence of random variables induced by the transforma-
. .
benko (1989), which states that a multilayer perceptron tions Yn = Jn (X) converges in distribution to Y = J(X).
can approximate any continuous function on a compact
subset of Rm . Note that specification of C n (x1:t−1 ) = Proof. (Lemma 4)
11
S(·) : [r0 , r1 ]m → [0, 1]m is a multivariate-multivariable Let h be any bounded, continuous function on Rm , so
function, where S(·)t is its t-th component, which is a univariate- that h ◦ Jn converges pointwise to h ◦ J by continuity
multivariable function, written as St (·, ·) : [r0 , r1 ]×[r0 , r1 ]t−1 → of h. Since h is bounded, then by the dominated con-
[0, 1]. vergence theorem, E[h(Yn )] = E[h(Jn (X))] converges
12
Note that C n is a chosen function (we are not assuming its to E[h(J(X)] = E[h(Y )]. As this result holds for any
parameterization; i.e. not necessarily a neural network) that we
seek to approximate using C k , which is the output of a neural
bounded continuous function h, by the Portmanteau’s
d
network. lemma, we have Yn −→Y.
Neural Autoregressive Flows