This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
11. VAEからの画像⽣成
¤ 𝑝(𝑥|𝑧)だけでなく𝑝(𝑧) ( = ∫ 𝑞(𝑧|𝑥)𝑝(𝑥)𝑑𝑥)も学習している
¤ データの多様体が𝑧で獲得されている.
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p✓(x|z) with the learned parameters ✓.
[Kingma+ 13]より
12. ⽣成画像
¤ ランダムな𝑧から画像をサンプリング
¤ 輪郭等がぼやける傾向がある.
anifold (b) Learned MNIST manifold
arned data manifold for generative models with two-dimensional latent
Since the prior of the latent space is Gaussian, linearly spaced coor-
were transformed through the inverse CDF of the Gaussian to produce
z. For each of these values z, we plotted the corresponding generative
ameters ✓.
(b) 5-D latent space (c) 10-D latent space (d) 20-D latent space
rom learned generative models of MNIST for different dimensionalities
[Kingma+ 13]より @AlecRad
22. Generative adversarial nets
¤ 全体の構造
¤ 直感的には,𝐺と𝐷で次のゲーム(ミニマックスゲーム)をする(敵対的学習).
¤ 𝐺はなるべく𝐷を騙すように𝑥を⽣成する.
¤ 𝐷はなるべく𝐺に騙されないように識別する.
->最終的には,𝐷が本物と区別できないような𝑥が𝐺から⽣成される.
min
s
max
t
𝑉(𝐷, 𝐺)
25. KLダイバージェンスとJSダイバージェンス
¤ 尤度最⼤化(KLダイバージェンス最⼩化)はデータのないところも覆ってしまう(B).
¤ KLを逆にすると,⼀つの峰だけにfitするようになる(C).
¤ JSダイバージェンスはちょうど中間あたりで学習(D).
[Huszar´+ 15]
Under review as a conference paper at ICLR 2016
A: P B: arg minQ JS0.1[PkQ] C: arg minQ JS0.5[PkQ] D: arg minQ JS0.99[PkQ]
Figure 1: Illustrating the behaviour of the generalised JS divergence under model underspecification
for a range of values of ⇡. Data is drawn from a multivariate Gaussian distribution P (A) and we aim
approximate it by a single isotropic Gaussian (B-D). Contours show level sets the approximating
distribution, overlaid on top of the 2D histogram of observed data. For ⇡ = 0.1, JS divergence
minimisation behaves like maximum likelihood (B), resulting in the characteristic moment matching
behaviour. For ⇡ = 0.99 (D), the behaviour becomes more akin to the mode-seeking behaviour
of minimising KL[QkP]. For the intermediate value of ⇡ = 0.5 (C) we recover the standard JS
divergence approximated by adversarial training. To produce this illustraiton we used software made
available by Theis et al. (2015).
It is easy to show that JS⇡ divergence converges to 0 in the limit of both ⇡ ! 0 and ⇡ ! 1. Cru-
cially, it can be shown that the gradients with respect to ⇡ at these two extremes recover KL[QkP]
and KL[PkQ], respectively. A proof of this property can be obtained by considering the Taylor-
expansion KL[QkQ + a] ⇡ aT
Ha, where H is the positive definite Hessian and substituting
a = ⇡(P Q) as follows:
lim
⇡!0
JSD[PkQ; ⇡]
⇡
= lim
⇡!0
⇢
KL[Pk⇡P + (1 ⇡)Q] +
1 ⇡)
⇡
KL[Qk⇡P + (1 ⇡)Q] (14)
26. GANの⽣成画像
¤ ランダムな𝑧から画像をサンプリング[Goodfellow+ 14]
Adversarial nets 225 ± 2 2057 ± 26
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-
likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we
computed the standard error across folds of the dataset, with a different chosen using the validation set of
each fold. On TFD, was cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [8] and used for various generative models for which the exact likelihood
is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood
has somewhat high variance and does not perform well in high dimensional spaces but it is the best
method available to our knowledge. Advances in generative models that can sample but not estimate
likelihood directly motivate further research into how to evaluate such models.
In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no
claim that these samples are better than samples generated by existing methods, we believe that these
samples are at least competitive with the better generative models in the literature and highlight the
potential of the adversarial framework.
a) b)
c) d)
Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of
the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples
are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain
mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator
and “deconvolutional” generator)
32. DCGANの⽣成画像
¤ 従来と⽐較して,はるかに綺麗な画像が⽣成できるようになった.
¤ DCGANはGANのブレイクスルー的研究
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate.
Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual
under-fitting via repeated noise textures across multiple samples such as the base boards of some of
the beds.
34. Mode collapse問題
¤ 元々の定式化は,𝐺を固定して𝐷を最適化するというものだった.
¤ 𝐺は最適化した𝐷を⽤いて学習する(minmax).
¤ (学習が不⼗分な)𝐷を固定して,𝐺を最適化した場合はどうなるか?
->𝐺の⽣成データが全て,ある峰(peaks)に対応するように学習してしまう(mode collapse).
Published as a conference paper at ICLR 2017
Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians
dataset. Columns show a heatmap of the generator distribution after increasing numbers of training
steps. The final column shows the data distribution. The top row shows training for a GAN with
10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The
bottom row shows standard GAN training. The generator rotates through the modes of the data
distribution. It never converges to a fixed distribution, and only ever assigns significant probability
mass to a single data mode at once.
Figure 3: Unrolled GAN training increases stability for an RNN generator and convolutional dis-
criminator trained on MNIST. The top row was run with 20 unrolling steps. The bottom row is a
standard GAN, with 0 unrolling steps. Images are samples from the generator after the indicated
number of training steps.
generator, but without backpropagating through the generator. In both cases we find that the unrolled
objective performs better.
3.2 PATHOLOGICAL MODEL WITH MISMATCHED GENERATOR AND DISCRIMINATOR
To evaluate the ability of this approach to improve trainability, we look to a traditionally challenging
family of models to train – recurrent neural networks (RNNs). In this experiment we try to generate
MNIST samples using an LSTM (Hochreiter & Schmidhuber, 1997). MNIST digits are 28x28 pixel
images. At each timestep of the generator LSTM, it outputs one column of this image, so that
after 28 timesteps it has output the entire sample. We use a convolutional neural network as the
Published as a conference paper at ICLR 2017
Figure 2: Unrolling the discriminator stabilizes GAN training on a toy 2D mixture of Gaussians
dataset. Columns show a heatmap of the generator distribution after increasing numbers of training
steps. The final column shows the data distribution. The top row shows training for a GAN with
10 unrolling steps. Its generator quickly spreads out and converges to the target distribution. The
bottom row shows standard GAN training. The generator rotates through the modes of the data
distribution. It never converges to a fixed distribution, and only ever assigns significant probability
mass to a single data mode at once.
[Metz+ 17]
43. cGANで⽣成した画像
¤ ⽂章情報で条件づけた例[Reed+ 16].
¤ 画像で条件づけた例[Isola+ 16](pix2pix,後ほど改めて紹介).
nerative Adversarial Text to Image Synthesis
Xinchen Yan, Lajanugen Logeswaran REEDSCOT1
, AKATA2
, XCYAN1
, LLAJAN1
SCHIELE2
,HONGLAK1
nn Arbor, MI, USA (UMICH.EDU)
formatics, Saarbr¨ucken, Germany (MPI-INF.MPG.DE)
tract
realistic images from text
nd useful, but current AI
m this goal. However, in
d powerful recurrent neu-
es have been developed
text feature representa-
convolutional generative
ANs) have begun to gen-
g images of specific cat-
album covers, and room
we develop a novel deep
ormulation to effectively
n text and image model-
concepts from characters
ate the capability of our
sible images of birds and
xt descriptions.
d in translating text in the form
itten descriptions directly into
“this small bird has a short,
e belly” or ”the petals of this
r are yellow”. The problem of
al descriptions gained interest
this small bird has a pink
breast and crown, and black
primaries and secondaries.
the flower has petals that
are bright pinkish purple
with white stigma
this magnificent fellow is
almost all black with a red
crest, and white cheek patch.
this white and yellow flower
have thin white petals and a
round yellow stamen
Figure 1. Examples of generated images from text descriptions.
Left: captions are from zero-shot (held out) categories, unseen
text. Right: captions are from the training set.
properties of attribute representations are attractive, at-
tributes are also cumbersome to obtain as they may require
domain-specific knowledge. In comparison, natural lan-
guage offers a general and flexible interface for describing
objects in any space of visual categories. Ideally, we could
have the generality of text descriptions with the discrimi-
55. Fréchet Inception Distance
¤ ISとは異なる評価指標[Heusel+ 17].
¤ Inceptionモデルの任意の層(pool 3層)に, 𝑝ABCB 𝑥 と 𝑝D(𝑥)からのサンプルを写像する.
¤ 埋め込んだ層を連続多変量ガウス分布と考えて,平均と共分散を計算する.
¤ それらを⽤いてFréchet距離(Wasserstein-2距離)を計算する(Fréchet inception distance, FID).
¤ IS(ここではinception distance)と⽐べて,適切な評価指標になっている.
𝐹𝐼𝐷 𝑑𝑎𝑡𝑎, 𝑔 = ||𝜇ABCB − 𝜇D||=
=
+ 𝑇𝑟(ΣABCB + ΣD − 2(ΣABCBΣD)
1
=)
Figure A8: Left: FID and right: Inception Score are evaluated for first row: Gaussian noise, second
row: Gaussian blur, third row: implanted black rectangles, fourth row: swirled images, fifth row.
salt and pepper noise, and sixth row: the CelebA dataset contaminated by ImageNet images. Left is
the smallest disturbance level of zero, which increases to the highest level at right. The FID captures
the disturbance level very well by monotonically increasing whereas the Inception Score fluctuates,
stays flat or even, in the worst case, decreases.
13
59. Image-to-image translation
¤ 画像から対応する画像を⽣成する.
¤ Pix2pix[Isola+ 16]
¤ Conditional GANの条件付けを変換前の画像とする.
¤ Gをautoencoderにする.
¤ BicycleGAN[Zhu+ 17]
¤ 決定論的な対応を学習するのではなく,潜在変数の分散を考慮したモデル.
Real or fake pair?
Positive examples Negative examples
Real or fake pair?
DD
G
G tries to synthesize fake
images that fool D
D tries to identify the fakes
Figure 2: Training a conditional GAN to predict aerial photos from
maps. The discriminator, D, learns to classify between real and
synthesized pairs. The generator learns to fool the discriminator.
Unlike an unconditional GAN, both the generator and discrimina-
tor observe an input image.
where G tries to minimize this objective against an ad-
versarial D that tries to maximize it, i.e. G⇤
=
arg minG maxD LcGAN (G, D).
To test the importance of conditioning the discrimintor,
we also compare to an unconditional variant in which the
discriminator does not observe x:
LGAN (G, D) =Ey⇠pdata(y)[log D(y)]+
Ex⇠pdata(x),z⇠pz(z)[log(1 D(G(x, z))].
(2)
Previous approaches to conditional GANs have found it
beneficial to mix the GAN objective with a more traditional
loss, such as L2 distance [29]. The discriminator’s job re-
mains unchanged, but the generator is tasked to not only
fool the discriminator but also to be near the ground truth
output in an L2 sense. We also explore this option, using
L1 distance rather than L2 as L1 encourages less blurring:
Figure 3
“U-Net”
tween m
this stra
nore the
Instead,
form of
at both t
observe
Designi
put, and
distribu
by the p
2.2. Ne
We a
from th
module
Details
with key
2.2.1
A defin
is that th
tion out
the inpu
are rend
structur
Image-to-Image Translation with Conditional Adversarial Networks
Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros
Berkeley AI Research (BAIR) Laboratory
University of California, Berkeley
{isola,junyanz,tinghuiz,efros}@eecs.berkeley.edu
Labels to Facade BW to Color
Aerial to Map
Labels to Street Scene
Edges to Photo
input output input
inputinput
input output
output
outputoutput
input output
Day to Night
Figure 1: Many problems in image processing, graphics, and vision involve translating an input image into a corresponding output image.
These problems are often treated with application-specific algorithms, even though the setting is always the same: map pixels to pixels.
Conditional adversarial nets are a general-purpose solution that appears to work well on a wide variety of these problems. Here we show
results of the method on several. In each case we use the same architecture and objective, and simply train on different data.
Abstract
We investigate conditional adversarial networks as a
general-purpose solution to image-to-image translation
problems. These networks not only learn the mapping from
input image to output image, but also learn a loss func-
tion to train this mapping. This makes it possible to apply
the same generic approach to problems that traditionally
would require very different loss formulations. We demon-
strate that this approach is effective at synthesizing photos
from label maps, reconstructing objects from edge maps,
and colorizing images, among other tasks. As a commu-
nity, we no longer hand-engineer our mapping functions,
and this work suggests we can achieve reasonable results
without hand-engineering our loss functions either.
Many problems in image processing, computer graphics,
and computer vision can be posed as “translating” an input
image into a corresponding output image. Just as a concept
may be expressed in either English or French, a scene may
be rendered as an RGB image, a gradient field, an edge map,
a semantic label map, etc. In analogy to automatic language
translation, we define automatic image-to-image translation
as the problem of translating one possible representation of
a scene into another, given sufficient training data (see Fig-
ure 1). One reason language translation is difficult is be-
cause the mapping between languages is rarely one-to-one
– any given concept is easier to express in one language
than another. Similarly, most image-to-image translation
problems are either many-to-one (computer vision) – map-
ping photographs to edges, segments, or semantic labels,
or one-to-many (computer graphics) – mapping labels or
sparse user inputs to realistic images. Traditionally, each of
these tasks has been tackled with separate, special-purpose
machinery (e.g., [7, 15, 11, 1, 3, 37, 21, 26, 9, 42, 46]),
despite the fact that the setting is always the same: predict
pixels from pixels. Our goal in this paper is to develop a
common framework for all these problems.
1
arXiv:1611.07004v1[cs.CV]21Nov2016
60. Image-to-image translation
¤ ペアになっていないデータで双⽅向に変換
¤ CycleGAN[Zhu+ 17]
¤ StarGAN[Choi+ 17]
¤ 2つのドメインではなく,複数のドメイン間で変換.
¤ GとDは1つだけ.represent one domain while those of men represent another.
Several image datasets come with a number of labeled
attributes. For instance, the CelebA[17] dataset contains 40
labels related to facial attributes such as hair color, gender,
and age, and the RaFD [11] dataset has 8 labels for facial
expressions such as ‘happy’, ‘angry’ and ‘sad’. These set-
tings enable us to perform more interesting tasks, namely
multi-domain image-to-image translation, where we change
images according to attributes from multiple domains. The
first five columns in Fig. 1 show how a CelebA image can
be translated according to any of the four domains, ‘blond
hair’, ‘gender’, ‘aged’, and ‘pale skin’. We can further ex-
tend to training multiple domains from different datasets,
such as jointly training CelebA and RaFD images to change
a CelebA image’s facial expression using features learned
by training on RaFD, as in the rightmost columns of Fig. 1.
However, existing models are both inefficient and inef-
fective in such multi-domain image translation tasks. Their
(a) Cross-domain models
21
4 3
G21 G12
G41
G14
G32
G23
G34 G43
2
1
5
4 3
(b) StarGAN
Figure 2. Comparison between cross-domain models and our pro-
posed model, StarGAN. (a) To handle multiple domains, cross-
domain models should be built for every pair of image domains.
(b) StarGAN is capable of learning mappings among multiple do-
mains using a single generator. The figure represents a star topol-
ogy connecting multi-domains.
62. Actor-critic
¤ ⽅策(actor)と価値関数(critic)を同時に学習する⼿法.
¤ MDPの下で(記号の説明を省略)...
¤ ⾏動価値関数は,
¤ ⽅策の更新:
¤ ⾏動価値関数の更新:
¤ よって,2段階の最適化アルゴリズムになる.
¤ なんかGANと似ている??
most reinforcement learning algorithms either focus on learning a value function, like value iteration
and TD-learning, or learning a policy directly, as in policy gradient methods, AC methods learn
both simultaneously - the actor being the policy and the critic being the value function. In some AC
methods, the critic provides a lower-variance baseline for policy gradient methods than estimating
the value from returns. In this case even a bad estimate of the value function can be useful, as the
policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is
updated with respect to the approximate value function, in which case pathologies similar to those in
GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead
to a bad policy which never fully explores the space, preventing a good value function from being
found and leading to degenerate solutions. A number of techniques exist to remedy this problem.
Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a
distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st)
and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action-
value function Qπ
(s, a) that predicts the expected discounted reward:
Qπ
(s, a) = Est+k∼P,rt+k∼R,at+k∼π
∞
k=1
γk
rt+k st = s, at = a (6)
and learn a policy that is optimal for that value function:
π∗
= arg max
π
Es0∼p0,a0∼π[Qπ
(s0, a0)] (7)
We can express Qπ
as the solution to a minimization problem:
Qπ
= arg min
Q
Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8)
Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic
problem can be expressed as a bilevel optimization problem as well:
F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9)
f(Q, π) = −Es0∼p0,a0∼π[Qπ
(s0, a0)] (10)
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the
policy through policy gradients and scale the policy gradient by the TD error, while the action-value
function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG)
[7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with
continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and
both simultaneously - the actor being the policy and the critic being the value function. In some AC
methods, the critic provides a lower-variance baseline for policy gradient methods than estimating
the value from returns. In this case even a bad estimate of the value function can be useful, as the
policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is
updated with respect to the approximate value function, in which case pathologies similar to those in
GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead
to a bad policy which never fully explores the space, preventing a good value function from being
found and leading to degenerate solutions. A number of techniques exist to remedy this problem.
Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a
distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st)
and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action-
value function Qπ
(s, a) that predicts the expected discounted reward:
Qπ
(s, a) = Est+k∼P,rt+k∼R,at+k∼π
∞
k=1
γk
rt+k st = s, at = a (6)
and learn a policy that is optimal for that value function:
π∗
= arg max
π
Es0∼p0,a0∼π[Qπ
(s0, a0)] (7)
We can express Qπ
as the solution to a minimization problem:
Qπ
= arg min
Q
Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8)
Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic
problem can be expressed as a bilevel optimization problem as well:
F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9)
f(Q, π) = −Es0∼p0,a0∼π[Qπ
(s0, a0)] (10)
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the
policy through policy gradients and scale the policy gradient by the TD error, while the action-value
function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG)
[7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with
continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and
observations are continuous, and use neural networks for function approximation for both the action-
both simultaneously - the actor being the policy and the critic being the value function. In some AC
methods, the critic provides a lower-variance baseline for policy gradient methods than estimating
the value from returns. In this case even a bad estimate of the value function can be useful, as the
policy gradient will be unbiased no matter what baseline is used. In other AC methods, the policy is
updated with respect to the approximate value function, in which case pathologies similar to those in
GANs can result. If the policy is optimized with respect to an incorrect value function, it may lead
to a bad policy which never fully explores the space, preventing a good value function from being
found and leading to degenerate solutions. A number of techniques exist to remedy this problem.
Formally, consider the typical MDP setting for RL, where we have a set of states S, actions A, a
distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st)
and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action-
value function Qπ
(s, a) that predicts the expected discounted reward:
Qπ
(s, a) = Est+k∼P,rt+k∼R,at+k∼π
∞
k=1
γk
rt+k st = s, at = a (6)
and learn a policy that is optimal for that value function:
π∗
= arg max
π
Es0∼p0,a0∼π[Qπ
(s0, a0)] (7)
We can express Qπ
as the solution to a minimization problem:
Qπ
= arg min
Q
Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8)
Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic
problem can be expressed as a bilevel optimization problem as well:
F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9)
f(Q, π) = −Es0∼p0,a0∼π[Qπ
(s0, a0)] (10)
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the
policy through policy gradients and scale the policy gradient by the TD error, while the action-value
function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG)
[7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with
continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and
observations are continuous, and use neural networks for function approximation for both the action-
value function and policy. This is an established approach in RL with continuous actions [11], and
all methods update the policy by passing back gradients of the estimated value with respect to the
actions rather than passing the TD error directly. The distinction between the methods lies mainly
in the way training proceeds. In NFQCA, the actor and critic are trained in batch mode after every
distribution over initial states p0(s), transition function P(st+1|st, at), reward distribution R(st)
and discount factor γ ∈ [0, 1]. The aim of actor-critic methods is to simultaneously learn an action-
value function Qπ
(s, a) that predicts the expected discounted reward:
Qπ
(s, a) = Est+k∼P,rt+k∼R,at+k∼π
∞
k=1
γk
rt+k st = s, at = a (6)
and learn a policy that is optimal for that value function:
π∗
= arg max
π
Es0∼p0,a0∼π[Qπ
(s0, a0)] (7)
We can express Qπ
as the solution to a minimization problem:
Qπ
= arg min
Q
Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (8)
Where D(·||·) is any divergence that is positive except when the two are equal. Now the actor-critic
problem can be expressed as a bilevel optimization problem as well:
F(Q, π) = Est,at∼π[D(Est+1,rt,at+1 [rt + γQ(st+1, at+1)]||Q(st, at))] (9)
f(Q, π) = −Es0∼p0,a0∼π[Qπ
(s0, a0)] (10)
There are many AC methods that attempt to solve this problem. Traditional AC methods optimize the
policy through policy gradients and scale the policy gradient by the TD error, while the action-value
function is updated by ordinary TD learning. We focus on deterministic policy gradients (DPG)
[7, 10] and its extension to stochastic policies, SVG(0) [8], as well as neurally-fitted Q-learning with
continuous actions (NFQCA) [9]. These algorithms are all intended for the case where actions and
observations are continuous, and use neural networks for function approximation for both the action-
63. Actor-criticとGAN
¤ Actor-criticとGANを同じ図式で並べてみる[Pfau+ 17].
¤ 𝐺=⽅策,𝐷=価値関数とみなせる.
¤ ただし,GANでは𝐺が𝑥(現在の状態)を受け取らない.
¤ 𝑧からのランダム.
z G D
x y
(a) Generative Adversarial Networks [4]
π
st
Q
rt
(b) Deterministic Policy Gradient [7] /
SVG(0) [8] / Neurally-Fitted Q-learning
with Continuous Actions [9]
Figure 1: Information structure of GANs and AC methods. Empty circles represent models with a
distinct loss function. Filled circles represent information from the environment. Diamonds repre-
sent fixed functions, both deterministic and stochastic. Solid lines represent the flow of information,
while dotted lines represent the flow of gradients used by another model. Paths which are analo-
gous between the two models are highlighted in red. The dependence of Q on future states and the
dependence of future states on π are omitted for clarity.
2 Algorithms
Both GANs and AC can be seen as bilevel or two-time-scale optimization problems, where one
model is optimized with respect to the optimum of another model:
∗ ∗
64. 双⽅のテクニックを利⽤できないか?
¤ 著者のPfau⽒はUnrolled GAN[Metz+ 17]を推している.
¤ というかUnrolled GAN提案者の⼀⼈.
Method GANs AC
Freezing learning yes yes
Label smoothing yes no
Historical averaging yes no
Minibatch discrimination yes no
Batch normalization yes yes
Target networks n/a yes
Replay buffers no yes
Entropy regularization no yes
Compatibility no yes
Table 1: Summary of different approaches used to stabilize and improve training for GANs and AC
methods. Those approaches that have been shown to improve performance are in green, those that
have not yet been demonstrated to improve training are in yellow, and those that are not applicable
to the particular method are in red.
Let the reward from the environment be 1 if the environment chose the real image and 0 if not. This
MDP is stateless as the image generated by the actor does not affect future data.
An actor-critic architecture learning in this environment clearly closely resembles the GAN game.
A few adjustments have to be made to make it identical. If the actor had access to the state it
65. SeqGAN
¤ SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient[Yu+ 17]
¤ 系列情報を⽣成するGAN
¤ 系列情報を学習するのは困難
¤ 系列情報は離散.
¤ 識別器は完全な系列しか評価できないため,部分的に⽣成された系列の評価が困難(最終的なスコア
が最⼤になるように𝐺を誘導したい).
¤ 解決策:強化学習の利⽤
¤ 𝐺を⽅策,𝐷を報酬とする.
¤ 𝐷から価値関数を求める.
Reward
Next
action
MC
search D
Reward
Reward
Reward
Policy Gradient
Left: D is trained over
G. Right: G is trained
ard signal is provided
diate action value via
generative model Gθ.
Gθ is updated by em-
ch on the basis of the
discriminative model
elihood that it would
where Y n
1:t = (y1, . . . , yt) and Y n
t+1:T is sampled based on
the roll-out policy Gβ and the current state. In our experi-
ment, Gβ is set the same as the generator, but one can use
a simplified version if the speed is the priority (Silver et al.
2016). To reduce the variance and get more accurate assess-
ment of the action value, we run the roll-out policy starting
from current state till the end of the sequence for N times to
get a batch of output samples. Thus, we have:
Q
Gθ
Dφ
(s = Y1:t−1, a = yt) = (4)
1
N
N
n=1 Dφ(Y n
1:T ), Y n
1:T ∈ MCGβ (Y1:t; N) for t < T
Dφ(Y1:t) for t = T,
where, we see that when no intermediate reward, the function
is iteratively defined as the next-state value starting from state
s′
= Y1:t and rolling out to the end.
A benefit of using the discriminator Dφ as a reward func-
tion is that it can be dynamically updated to further improve
the generative model iteratively. Once we have a set of more
realistic generated sequences, we shall re-train the discrimi-
Reward
Next
action
State
MC
searchG D
Generate
True data
Train
G
Real World
D
Reward
Reward
Reward
Policy Gradient
Figure 1: The illustration of SeqGAN. Left: D is trained over
the real data and the generated data by G. Right: G is trained
by policy gradient where the final reward signal is provided
by D and is passed back to the intermediate action value via
Monte Carlo search.
synthetic sequences generated from the generative model Gθ.
w
th
m
a s
20
m
fro
ge
Q
w
is
s′
66. SeqGAN:擬似コード
Algorithm 1 Sequence Generative Adversarial Nets
Require: generator policy Gθ; roll-out policy Gβ; discriminator
Dφ; a sequence dataset S = {X1:T }
1: Initialize Gθ, Dφ with random weights θ, φ.
2: Pre-train Gθ using MLE on S
3: β ← θ
4: Generate negative samples using Gθ for training Dφ
5: Pre-train Dφ via minimizing the cross entropy
6: repeat
7: for g-steps do
8: Generate a sequence Y1:T = (y1, . . . , yT ) ∼Gθ
9: for t in 1 :T do
10: Compute Q(a = yt;s = Y1:t−1) by Eq. (4)
11: end for
12: Update generator parameters via policy gradient Eq. (8)
13: end for
14: for d-steps do
15: Use current Gθ to generate negative examples and com-
bine with given positive examples S
16: Train discriminator Dφ for k epochs by Eq. (5)
17: end for
18: β ← θ
19: until SeqGAN converges
In summary, Algorithm 1 shows full details of the pro-
posed SeqGAN. At the beginning of the training, we use
the maximum likelihood estimation (MLE) to pre-train Gθ
on training set S. We found the supervised signal from the
Short-Term Memory (LSTM) cells (Hochrei
huber 1997) to implement the update functio
is worth noticing that most of the RNN varia
gated recurrent unit (GRU) (Cho et al. 2014
tion mechanism (Bahdanau, Cho, and Bengi
used as a generator in SeqGAN.
The Discriminative Model for Sequenc
Deep discriminative models such as deep
(DNN) (Vesel`y et al. 2013), convolutional
(CNN) (Kim 2014) and recurrent convoluti
work (RCNN) (Lai et al. 2015) have shown
mance in complicated sequence classificatio
paper, we choose the CNN as our discrim
has recently been shown of great effectiven
ken sequence) classification (Zhang and LeC
discriminative models can only perform cla
for an entire sequence rather than the unfinis
paper, we also focus on the situation where th
predicts the probability that a finished seque
We first represent an input sequence x1, .
E1:T = x1 ⊕x2⊕. . . ⊕xT ,
where xt ∈ Rk
is the k-dimensional token
⊕ is the concatenation operator to build the
RT ×k
. Then a kernel w ∈ Rl×k
applies a
operation to a window size of l words to
68. 逆強化学習
¤ Inverse reinforcement learning(IRL)
¤ ⽬標の⾏動(最適戦略)から報酬関数を推定する⼿法.
¤ Maximum entropy IRL(MaxEnt IRL)[Ng+ 00]
¤ 軌道 がボルツマン分布に従うと考える.
¤ 最適な軌道で尤度が最⼤になると考える -> 尤度最⼤化
¤ しかし,分配関数𝑍をどうする・・・?
um or integral for most high-dimensional problems. A common approach to estimating Z requires
ampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods;
however, these methods face issues when there are several distinct modes of the distribution and, as
a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate
nference methods can also be used during training, though the energy function may incorrectly
assign low energy to some modes if the approximate inference method cannot find them [14].
2.3 Inverse Reinforcement Learning
The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated
behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving
near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost
earning, an algorithm for MaxEnt IRL.
2.3.1 Maximum entropy inverse reinforcement learning
Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann
distribution, where the energy is given by the cost function cθ :
pθ (τ) =
1
Z
exp(−cθ (τ)),
Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function
parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z
s the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
cs.2
Under this model, the optimal trajectories have the highest likelihood, and the expert can generate
uboptimal trajectories with a probability that decreases exponentially as the trajectories become
more costly. As in other energy-based models, the parameters θ are optimized to maximize the like-
main challenge in this optimization is evaluating the partition function Z, which is an intractable
sum or integral for most high-dimensional problems. A common approach to estimating Z requires
sampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods;
however, these methods face issues when there are several distinct modes of the distribution and, as
a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate
inference methods can also be used during training, though the energy function may incorrectly
assign low energy to some modes if the approximate inference method cannot find them [14].
2.3 Inverse Reinforcement Learning
The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated
behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving
near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost
learning, an algorithm for MaxEnt IRL.
2.3.1 Maximum entropy inverse reinforcement learning
Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann
distribution, where the energy is given by the cost function cθ :
pθ (τ) =
1
Z
exp(−cθ (τ)),
Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function
parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z
is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
ics.2
Under this model, the optimal trajectories have the highest likelihood, and the expert can generate
suboptimal trajectories with a probability that decreases exponentially as the trajectories become
more costly. As in other energy-based models, the parameters θ are optimized to maximize the like-
lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous
domains, and presents the main computational challenge. The first applications of this model com-
puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete
domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown.
2.3.2 Guided cost learning
Guided cost learning introduces an iterative sample-based method for estimating Z in the Max-
Boltzmann distribution:
pθ (x) =
1
Z
exp(−Eθ (x)) (1)
The energy function parameters θ are often chosen to maximize the likelihood of the data; the
main challenge in this optimization is evaluating the partition function Z, which is an intractable
sum or integral for most high-dimensional problems. A common approach to estimating Z requires
sampling from the Boltzmann distribution pθ (x) within the inner loop of learning.
Sampling from pθ (x) can be approximated by using Markov chain Monte Carlo (MCMC) methods;
however, these methods face issues when there are several distinct modes of the distribution and, as
a result, can take arbitrarily large amounts of time to produce a diverse set of samples. Approximate
inference methods can also be used during training, though the energy function may incorrectly
assign low energy to some modes if the approximate inference method cannot find them [14].
2.3 Inverse Reinforcement Learning
The goal of inverse reinforcement learning is to infer the cost function underlying demonstrated
behavior [15]. It is typically assumed that the demonstrations come from an expert who is behaving
near-optimally under some unknown cost. In this section, we discuss MaxEnt IRL and guided cost
learning, an algorithm for MaxEnt IRL.
2.3.1 Maximum entropy inverse reinforcement learning
Maximum entropy inverse reinforcement learning models the demonstrations using a Boltzmann
distribution, where the energy is given by the cost function cθ :
pθ (τ) =
1
Z
exp(−cθ (τ)),
Here, τ = {x1,u1,...,xT ,uT } is a trajectory; cθ (τ) = ∑t cθ (xt,ut) is a learned cost function
parametrized by θ; xt and ut are the state and action at time step t; and the partition function Z
is the integral of exp(−cθ (τ)) over all trajectories that are consistent with the environment dynam-
ics.2
Under this model, the optimal trajectories have the highest likelihood, and the expert can generate
suboptimal trajectories with a probability that decreases exponentially as the trajectories become
more costly. As in other energy-based models, the parameters θ are optimized to maximize the like-
lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous
domains, and presents the main computational challenge. The first applications of this model com-
puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete
domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown.
コスト関数(報酬関数)
https://jangirrishabh.github.io/2016/07/09/virtual-car-IRL/
69. 逆強化学習
¤ Guided cost learning (GCL) [Finn+ 16]
¤ 分配関数を推定するために,提案分布𝑞(𝜏)を考えて重点サンプリングによって求める.
¤ 提案分布が⾼い尤度で軌道をカバーできないと重点サンプリングが⾼バリアンスになる可能性があるので,デ
モンストレーション𝑝+も利⽤する.
¤ 𝑞は次のコストを最⼩化して求める(詳細は省略).
-> 𝜃と𝑞について交互に最適化すればいい.
Under this model, the optimal trajectories have the highest likelihood, and the expert can generate
suboptimal trajectories with a probability that decreases exponentially as the trajectories become
more costly. As in other energy-based models, the parameters θ are optimized to maximize the like-
lihood of the demonstrations. Estimating the partition function Z is difficult for large or continuous
domains, and presents the main computational challenge. The first applications of this model com-
puted Z exactly with dynamic programming [27]. However, this is only practical in small, discrete
domains, and is impossible in domains where the system dynamics p(xt+1|xt,ut) are unknown.
2.3.2 Guided cost learning
Guided cost learning introduces an iterative sample-based method for estimating Z in the Max-
Ent IRL formulation, and can scale to high-dimensional state and action spaces and nonlinear cost
functions [7]. The algorithm estimates Z by training a new sampling distribution q(τ) and using
importance sampling:
Lcost(θ) = Eτ∼p[−log pθ (τ)] = Eτ∼p[cθ (τ)]+ logZ
= Eτ∼p[cθ (τ)]+ log Eτ∼q
exp(−cθ (τ))
q(τ)
.
Guided cost learning alternates between optimizing cθ using this estimate, and optimizing q(τ) to
minimize the variance of the importance sampling estimate.
2This formula assumes that xt+1 is a deterministic function of the previous history. A more general form
of this equation can be derived for stochastic dynamics [26]. However, the analysis largely remains the same:
the probability of a trajectory can be written as the product of conditional probabilities, but the conditional
probabilities of the states xt are not affected by θ and so factor out of all likelihood ratios.
3
The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ
is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy
q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and
1
Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy.
Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2)
Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost
function. Thus, this training procedure results in both a learned cost function, characterizing the
demonstration distribution, and a learned policy q(τ), capable of generating samples from the
demonstration distribution.
This importance sampling estimate can have very high variance if the sampling distribution q fails
to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have
low cost (as a result of the IRL objective), we can address this coverage problem by mixing the
demonstration data samples with the generated samples. Let µ = 1
2 p+ 1
2 q be the mixture distribution
over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for
example we could use the current model pθ , or we could use a simpler density model trained using
another method. Guided cost learning uses µ for importance sampling3, with 1
2 p(τ) + 1
2 q(τ) as the
importance weights:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
,
2.4 Direct Maximum Likelihood and Behavioral Cloning
A simple approach to imitation learning and generative modeling is to train a generator or policy
to output a distribution over the data, without learning a discriminator or energy function. For
tractability, the data distribution is typically factorized using a directed graphical model or Bayesian
network. In the field of generative modeling, this approach has most commonly been applied to
speech and language generation tasks [23, 18], but has also been applied to image generation [22].
Like most EBMs, these models are trained by maximizing the likelihood of the observed data points.
When a generative model does not have the capacity to represent the entire data distribution, max-
imizing likelihood directly will lead to a moment-matching distribution that tries to “cover” all of
the modes, leading to a solution that puts much of its mass in parts of the space that have negligible
probability under the true distribution. In many scenarios, it is preferable to instead produce only re-
alistic, highly probable samples, by “filling in” as many modes as possible, at the trade-off of lower
The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ
is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy
q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and
1
Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy.
Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2)
Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost
function. Thus, this training procedure results in both a learned cost function, characterizing the
demonstration distribution, and a learned policy q(τ), capable of generating samples from the
demonstration distribution.
This importance sampling estimate can have very high variance if the sampling distribution q fails
to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have
low cost (as a result of the IRL objective), we can address this coverage problem by mixing the
demonstration data samples with the generated samples. Let µ = 1
2 p+ 1
2 q be the mixture distribution
over trajectory roll-outs. Let p(τ) be a rough estimate for the density of the demonstrations; for
example we could use the current model pθ , or we could use a simpler density model trained using
another method. Guided cost learning uses µ for importance sampling3, with 1
2 p(τ) + 1
2 q(τ) as the
importance weights:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
,
2.4 Direct Maximum Likelihood and Behavioral Cloning
A simple approach to imitation learning and generative modeling is to train a generator or policy
to output a distribution over the data, without learning a discriminator or energy function. For
tractability, the data distribution is typically factorized using a directed graphical model or Bayesian
network. In the field of generative modeling, this approach has most commonly been applied to
speech and language generation tasks [23, 18], but has also been applied to image generation [22].
Like most EBMs, these models are trained by maximizing the likelihood of the observed data points.
When a generative model does not have the capacity to represent the entire data distribution, max-
The optimal importance sampling distribution for estimating the partition function exp(−cθ (τ))dτ
is q(τ) ∝ |exp(−cθ (τ))| = exp(−cθ (τ)). During guided cost learning, the sampling policy
q(τ) is updated to match this distribution by minimizing the KL divergence between q(τ) and
1
Z exp(−cθ (τ)), or equivalently minimizing the learned cost and maximizing entropy.
Lsampler(q) = Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] (2)
Conveniently, this optimal sampling distribution is the demonstration distribution for the true cost
function. Thus, this training procedure results in both a learned cost function, characterizing the
demonstration distribution, and a learned policy q(τ), capable of generating samples from the
demonstration distribution.
This importance sampling estimate can have very high variance if the sampling distribution q fails
to cover some trajectories τ with high values of exp(−cθ (τ)). Since the demonstrations will have
low cost (as a result of the IRL objective), we can address this coverage problem by mixing the
𝜇=
70. GANの再解釈
¤ GANの最適な識別器は
¤ ここでモデル分布𝑞の確率密度は推定できるとする(GANの前提を変える).
¤ さらに真の分布𝑝をコスト関数でパラメータ化したものに置き換える.
¤ するとGANの識別器の損失関数は
¤ Dのパラメータ=コスト関数のパラメータ
¤ のとき最適.
¤ [再掲]MaxEnt IRLのコストは( とすると)
verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The
derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After
making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec-
tion 3.2, where the discriminator involves the learned cost and the generator represents the policy.
3.1 A special form of discriminator
For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol-
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
where p(τ) is the actual distribution of the data.
In the traditional GAN algorithm, the discriminator is trained to directly output this value. When
the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified
to incorporate this density information. Instead of having the discriminator estimate the value of
Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value.
In this case, the new form of the discriminator Dθ with parameters θ is
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
In order to make the connection to MaxEnt IRL, we also replace the estimated data density with
the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the
learned cost. Now the discriminator’s output is:
Dθ (τ) =
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
.
The resulting architecture for the discriminator is very similar to a typical model for binary classi-
fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted
the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change
allows the optimal discriminator to be completely independent of the generator: the discriminator is
optimal when 1
Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim-
inator may significantly improve the stability of training.
This change is very simple to implement and is applicable in any setting where the density q(τ)
can be cheaply evaluated. Of course this is precisely the case where we could directly maximize
likelihood, and we might wonder whether it is worth the additional complexity of GAN training.
But the experience of researchers in IRL has shown that maximizing log likelihood directly is not
always the most effective way to learn complex behaviors, even when it is possible to implement. As
we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting
alleviate the issue, they do not solve it completely.
3 GANs and IRL
We now show how generative adversarial modeling has implicitly been applied to the setting of in-
verse reinforcement learning, where the data-to-be-modeled is a set of expert demonstrations. The
derivation requires a particular form of discriminator, which we discuss first in Section 3.1. After
making this modification to the discriminator, we obtain an algorithm for IRL, as we show in Sec-
tion 3.2, where the discriminator involves the learned cost and the generator represents the policy.
3.1 A special form of discriminator
For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol-
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
where p(τ) is the actual distribution of the data.
In the traditional GAN algorithm, the discriminator is trained to directly output this value. When
the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified
to incorporate this density information. Instead of having the discriminator estimate the value of
Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value.
In this case, the new form of the discriminator Dθ with parameters θ is
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
In order to make the connection to MaxEnt IRL, we also replace the estimated data density with
the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the
learned cost. Now the discriminator’s output is:
Dθ (τ) =
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
.
The resulting architecture for the discriminator is very similar to a typical model for binary classi-
fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted
the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change
allows the optimal discriminator to be completely independent of the generator: the discriminator is
optimal when 1
Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim-
inator may significantly improve the stability of training.
This change is very simple to implement and is applicable in any setting where the density q(τ)
can be cheaply evaluated. Of course this is precisely the case where we could directly maximize
likelihood, and we might wonder whether it is worth the additional complexity of GAN training.
But the experience of researchers in IRL has shown that maximizing log likelihood directly is not
always the most effective way to learn complex behaviors, even when it is possible to implement. As
we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting
that the same phenomenon may occur in other domains: GAN training may provide advantages even
when it would be possible to maximize likelihood directly.
3.1 A special form of discriminator
For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol-
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
where p(τ) is the actual distribution of the data.
In the traditional GAN algorithm, the discriminator is trained to directly output this value. When
the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified
to incorporate this density information. Instead of having the discriminator estimate the value of
Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value.
In this case, the new form of the discriminator Dθ with parameters θ is
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
In order to make the connection to MaxEnt IRL, we also replace the estimated data density with
the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the
learned cost. Now the discriminator’s output is:
Dθ (τ) =
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
.
The resulting architecture for the discriminator is very similar to a typical model for binary classi-
fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted
the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change
allows the optimal discriminator to be completely independent of the generator: the discriminator is
optimal when 1
Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim-
inator may significantly improve the stability of training.
This change is very simple to implement and is applicable in any setting where the density q(τ)
can be cheaply evaluated. Of course this is precisely the case where we could directly maximize
likelihood, and we might wonder whether it is worth the additional complexity of GAN training.
But the experience of researchers in IRL has shown that maximizing log likelihood directly is not
always the most effective way to learn complex behaviors, even when it is possible to implement. As
we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting
that the same phenomenon may occur in other domains: GAN training may provide advantages even
when it would be possible to maximize likelihood directly.
3.2 Equivalence between generative adversarial networks and guided cost learning
In this section, we show that GANs, when applied to IRL problems, optimize the same objective as
MaxEnt IRL, and in fact the variant of GANs described in the previous section is precisely equivalent
to guided cost learning.
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+ q(τ)
+ Eτ∼q −log
q(τ)
1
Z exp(−cθ (τ))+ q(τ)
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima-
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
+Eτ∼q −log
q(τ)
1
Z exp(−cθ (τ))+q(τ)
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+ q(τ)
+ Eτ∼q −log
q(τ)
1
Z exp(−cθ (τ))+ q(τ)
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+Eτ∼q[−log(1−Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
+Eτ∼q −log
q(τ)
1
Z exp(−cθ (τ))+q(τ)
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima-
tor for the partition function, as described in Section 2.3.2.
Recall that the discriminator’s loss is equal to
Ldiscriminator(Dθ ) = Eτ∼p[−logDθ (τ)]+ Eτ∼q[−log(1 − Dθ(τ))]
= Eτ∼p −log
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+ q(τ)
+ Eτ∼q −log
q(τ)
1
Z exp(−cθ (τ))+ q(τ)
In maximum entropy IRL, the log-likelihood objective is:
Lcost(θ) = Eτ∼p[cθ (τ)]+ log Eτ∼ 1
2 p+ 1
2 q
exp(−cθ (τ))
1
2 p(τ)+ 1
2 q(τ)
(4)
= Eτ∼p[cθ (τ)]+ log Eτ∼µ
exp(−cθ (τ))
1
2Z exp(−cθ (τ))+ 1
2 q(τ)
, (5)
where we have substituted p(τ) = pθ (τ) = 1
Z exp(−cθ (τ)), i.e. we are using the current model to
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima-
tor for the partition function, as described in Section 2.3.2.
2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to
the derivative of the MaxEnt IRL objective.
3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the
3.1 A special form of discriminator
For a fixed generator with a [typically unknown] density q(τ), the optimal discriminator is the fol-
lowing [8]:
D∗
(τ) =
p(τ)
p(τ)+q(τ)
, (3)
where p(τ) is the actual distribution of the data.
In the traditional GAN algorithm, the discriminator is trained to directly output this value. When
the generator density q(τ) can be evaluated, the traditional GAN discriminator can be modified
to incorporate this density information. Instead of having the discriminator estimate the value of
Equation 3 directly, it can be used to estimate p(τ), filling in the value of q(τ) with its known value.
In this case, the new form of the discriminator Dθ with parameters θ is
Dθ (τ) =
˜pθ (τ)
˜pθ (τ)+q(τ)
.
In order to make the connection to MaxEnt IRL, we also replace the estimated data density with
the Boltzmann distribution. As in MaxEnt IRL, we write the energy function as cθ to designate the
learned cost. Now the discriminator’s output is:
Dθ (τ) =
1
Z exp(−cθ (τ))
1
Z exp(−cθ (τ))+q(τ)
.
The resulting architecture for the discriminator is very similar to a typical model for binary classi-
fication, with a sigmoid as the final layer and logZ as the bias of the sigmoid. We have adjusted
the architecture only by subtracting logq(τ) from the input to the sigmoid. This modest change
allows the optimal discriminator to be completely independent of the generator: the discriminator is
optimal when 1
Z exp(−cθ (τ)) = p(τ). Independence between the generator and the optimal discrim-
inator may significantly improve the stability of training.
This change is very simple to implement and is applicable in any setting where the density q(τ)
can be cheaply evaluated. Of course this is precisely the case where we could directly maximize
likelihood, and we might wonder whether it is worth the additional complexity of GAN training.
But the experience of researchers in IRL has shown that maximizing log likelihood directly is not
always the most effective way to learn complex behaviors, even when it is possible to implement. As
we will show, there is a precise equivalence between MaxEnt IRL and this type of GAN, suggesting
that the same phenomenon may occur in other domains: GAN training may provide advantages even
when it would be possible to maximize likelihood directly.
71. GANとIRLの関係
¤ 次のことが証明されている[Finn+ 17].
¤ 識別器の⽬的関数を最⼩化する𝑍は,分配関数の重点サンプリングに対応する.
¤ この𝑍で,識別器の⽬的関数の勾配=MaxEnt IRLの⽬的関数の勾配となる.
¤ ⽣成器の⽬的関数=MaxEnt IRLの提案分布の⽬的関数
MaxEnt policy loss defined in Equation 2 in Section 2.3.2.
Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1
2Z exp(−cθ (τ)) + 1
2 q(τ).
Note that when θ and Z are optimized, 1
Z exp(−cθ (τ)) is an estimate for the density of p(τ), and
hence µ(τ) is an estimate for the density of µ.
3.2.1 Z estimates the partition function
We can compute the discriminator’s loss:
Ldiscriminator(Dθ ) =Eτ∼p −log
1
Z exp(−cθ (τ))
µ(τ)
+ Eτ∼q −log
q(τ)
µ(τ)
(6)
=logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7)
=logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8)
Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term
with respect to Z will be zero:
∂ZLdiscriminator(Dθ ) = 0
1
Z
= Eτ∼µ
1
Z2 exp(−cθ (τ))
µ(τ)
Z = Eτ∼µ
exp(−cθ (τ))
µ(τ)
.
Thus the minimizing Z is precisely the importance sampling estimate of the partition function in
Equation 4.
3.2.2 cθ optimizes the IRL objective
We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with
respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL
objective.
6
estimate the importance weights.
We will establish the following facts, which together imply that GANs optimize precisely the Max-
Ent IRL problem:
1. The value of Z which minimizes the discriminator’s loss is an importance-sampling estima-
tor for the partition function, as described in Section 2.3.2.
2. For this value of Z, the derivative of the discriminator’s loss with respect to θ is equal to
the derivative of the MaxEnt IRL objective.
3. The generator’s loss is exactly equal to the cost cθ minus the entropy of q(τ), i.e. the
MaxEnt policy loss defined in Equation 2 in Section 2.3.2.
Recall that µ is the mixture distribution between p and q. Write µ(τ) = 1
2Z exp(−cθ (τ)) + 1
2 q(τ).
Note that when θ and Z are optimized, 1
Z exp(−cθ (τ)) is an estimate for the density of p(τ), and
hence µ(τ) is an estimate for the density of µ.
3.2.1 Z estimates the partition function
We can compute the discriminator’s loss:
Ldiscriminator(Dθ ) =Eτ∼p −log
1
Z exp(−cθ (τ))
µ(τ)
+ Eτ∼q −log
q(τ)
µ(τ)
(6)
=logZ + Eτ∼p[cθ (τ)]+ Eτ∼p[logµ(τ)]− Eτ∼q[logq(τ)]+ Eτ∼q[logµ(τ)] (7)
=logZ + Eτ∼p[cθ (τ)]− Eτ∼q[logq(τ)]+ 2Eτ∼µ[logµ(τ)]. (8)
Only the first and last terms depend on Z. At the minimizing value of Z, the derivative of these term
with respect to Z will be zero:
∂ZLdiscriminator(Dθ ) = 0
1
Z
= Eτ∼µ
1
Z2 exp(−cθ (τ))
µ(τ)
Z = Eτ∼µ
exp(−cθ (τ))
µ(τ)
.
Thus the minimizing Z is precisely the importance sampling estimate of the partition function in
Equation 4.
3.2.2 cθ optimizes the IRL objective
We return to the discriminator’s loss as computed in Equation 8, and consider the derivative with
respect to the parameters θ. We will show that this is exactly the same as the derivative of the IRL
objective.
6
Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we
obtain:
∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
Z exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
On the other hand, when we differentiate the MaxEnt IRL objective, we obtain:
∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ
−exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
Z exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
= ∂θ Ldiscriminator(Dθ ).
In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the
second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This
is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the
weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate
through the importance weights.
3.3 The generator optimizes the MaxEnt IRL objective
Finally, we compute the generator’s loss:
Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))]
= Eτ∼q log
q(τ)
µ(τ)
− log
1
Z exp(−cθ (τ))
µ(τ)
= Eτ∼q[logq(τ)+ logZ + cθ (τ)]
= logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q).
The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator,
this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2.
Only the second and fourth terms in the sum depend on θ. When we differentiate those terms we
obtain:
∂θ Ldiscriminator(Dθ ) = Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
Z exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
On the other hand, when we differentiate the MaxEnt IRL objective, we obtain:
∂θ Lcost(θ) = Eτ∼p[∂θ cθ (τ)]+ ∂θ log Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]+ Eτ∼µ
−exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
Eτ∼µ
exp(−cθ (τ))
µ(τ)
= Eτ∼p[∂θ cθ (τ)]− Eτ∼µ
1
Z exp(−cθ (τ))∂θ cθ (τ)
µ(τ)
= ∂θ Ldiscriminator(Dθ ).
In the third equality, we used the definition of Z as an importance sampling estimate. Note that in the
second equality, we have treated µ(τ) as a constant rather than as a quantity that depends on θ. This
is because the IRL optimization is minimizing logZ = log∑τ exp(−cθ (τ)) and using µ(τ) as the
weights for an importance sampling estimator of Z. For this purpose we do not want to differentiate
through the importance weights.
3.3 The generator optimizes the MaxEnt IRL objective
Finally, we compute the generator’s loss:
Lgenerator(q) = Eτ∼q[log(1 − D(τ))− log(D(τ))]
= Eτ∼q log
q(τ)
µ(τ)
− log
1
Z exp(−cθ (τ))
µ(τ)
= Eτ∼q[logq(τ)+ logZ + cθ (τ)]
= logZ + Eτ∼q[cθ (τ)]+ Eτ∼q[logq(τ)] = logZ + Lsampler(q).
The term logZ is a parameter of the discriminator that is held fixed while optimizing the generator,
this loss is exactly equivalent the sampler loss from MaxEnt IRL, defined in Equation 2.
3.4 Discussion
There are many apparent differences between MaxEnt IRL and the GAN optimization problem. But,
we have shown that after making a single key change—using a generator q(τ) for which densities
74. LeCun先⽣⽈く
Y LeCun
How Much Information Does the Machine Need to Predict?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar
reward given once in a while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category
or a few numbers for each input
Predicting human-supplied data
10 10,000 bits per sample→
Unsupervised/Predictive Learning (cake)
The machine predicts any part of
its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
(Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)