SlideShare a Scribd company logo
教師なし
画像特徴表現学習の動向
鈴⽊ 智之 (@tomoyukun)
1
http://hirokatsukataoka.net/project/cc/index_cvpaperchallenge.html
CVPR 2018 完全読破チャレンジ報告会 cvpaper.challenge勉強会
@Wantedly⽩⾦台オフィス
配布⽤
{Un, Self} supervised representation learning
cvpaper.challenge 2
n 鈴⽊ 智之 (すずき ともゆき)
➤ Twitter : @tomoyukun
➤ 所属:慶応⼤ 修⼠2年
- ⻘⽊研究室
- 産総研RA (2017/5~)
- cvpapar.challenge (2017/5~)
➤ 研究の興味
- ⾏動認識,表現学習など
➤ 国際発表論⽂
- Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB,
CVPR 2018.
- Learning Spatiotemporal 3D Convolution with Video Order Self-supervision,
ECCVWS 2018.
- Semantic Change Detection, ICARCV 2018.
⾃⼰紹介
cvpaper.challenge 3
n 教師なし特徴表現学習とは?
➤ 定義
➤ 評価⽅法
➤ アプローチの⼤別
n 論⽂紹介
➤ ~ CVPR 2017
➤ ~ CVPR 2018
➤ さらに最新の動向
n まとめ
n Appendix
➤ 相互情報量の最⼤化
本⽇の内容
研究を羅列していきます
cvpaper.challenge 4
n 今回の特徴表現の良さ=discriminative
- 解きたいタスク (target task) に有効なデータの特徴表現を
擬似的なタスク (pretext task) を事前に解くことで獲得する
- disentangleなど,他の良さについては問わない
n Self-supervised
- ⾃動で⽣成できる教師信号を⽤いてpretext taskを定義
- 画像,動画,⾔語,マルチモーダル
n Self-supervised以外 (Unsupervised)
- データ分布を表現するモデルを学習する (教師はない)
教師なし特徴表現学習とは?
教師がないデータを⽤いてそれらの良い特徴
表現を獲得すること(そのまま)
CNNと画像のみを⽤いた教師なし特徴表現学習について
cvpaper.challenge 5
n 評価⽅法① : 特徴抽出+識別器
➤ Pretext taskで学習したモデルを重み固定の特徴抽出器として⽤い,
特徴量のTarget task での性能を測る
➤ 同じデータセット内で評価することが多い
- Pretext : ラベルなしImageNet => Target : ラベルありImageNet
➤ AlexNetで評価するのがスタンダード (になってしまっている)
どうやって良い特徴表現かを評価する?
モデル
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
モデル
Target task
識
別
器
固定学習 学習
(ex. ImageNet classification)
+
画像
データ ラベル
画像
データ
cvpaper.challenge 6
n 評価⽅法➁ : Fine-tuning
➤ Pretext taskで学習したパラメータを初期値として⽤い,Target task
でFine-tuningした時の性能を測る
➤ 異なるデータセット間で評価を⾏うことが多い
- Pretext : ラベルなしImageNet => Target : ラベルありPascal VOC
➤ AlexNetで評価するのがスタンダードなのは評価⽅法①と同様
どうやって良い特徴表現かを評価する?
モデル
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
Target task
学習
+
画像
データ ラベル
モデル
学習 画像
データ
今回はラベルなしImageNet => Pascal VOC*を基準
(ex. Pascal VOC segmentation)
* classification : %mAP, detection : %mAP, segmentation : %mIoU
cvpaper.challenge 7
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Spot Artifact
Colorization
Split-brain
VAE系
GAN系
Instance
Discrimination
Jigsaw
Jigsaw++
Rotation
Counting
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
Autoencoder系
Context Encoder
Noise as target
Exemplar CNN
cvpaper.challenge 8
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
Colorization
Split-brain
VAE系
GAN系Jigsaw
Jigsaw++
Rotation
n 識別系
➤ 教師なしデータ𝑥に対応する,⾃動で得られるカテゴリ𝑡を定義
- 教師ありデータ(𝑥, 𝑡)となる
- 𝑥に施された何らかの処理𝜙(⋅)に応じて𝑡を定義する場合が多い
- その場合は教師ありデータ(𝜙(𝑥), 𝑡)
Autoencoder系
Context Encoder
その他
Counting
Instance
Discrimination
Noise as target
Exemplar CNN
cvpaper.challenge 9
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
Colorization
Split-brain
VAE系
GAN系Jigsaw
Jigsaw++
Rotation
n 再構成系
➤ 𝑥 = {𝑥*, 𝑥+}の⼀部を観測できている状態で𝑥または𝑥+を推定
- 全て観測できている場合がAuto encoder
- 回帰学習や条件付き⽣成モデル的アプローチがある
Autoencoder系
Context Encoder
その他
Counting
Instance
Discrimination
Noise as target
Exemplar CNN
cvpaper.challenge 10
Pretext taskの⼤別
n ⽣成モデル系
➤ データ分布𝑝(𝑥)を学習することに付随して表現を獲得
- VAEは潜在変数,GANはdiscriminatorの中間特徴など
- (個⼈的には) うまく学習できれば⼀番良い表現を獲得できそう
- しかし, 𝑝(𝑥)の学習が難しい (下界の最⼤化,ミニマックス問題)
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
Colorization
Split-brain
VAE系
GAN系Jigsaw
Jigsaw++
Rotation
Autoencoder系
Context Encoder
基本的にself-supervisedと⾔われない
その他
Counting
Instance
Discrimination
Noise as target
Exemplar CNN
~ CVPR 2017
cvpaper.challenge 12
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Colorization
Split-brain
VAE系
GAN系Jigsaw
AutoEncoder系
Context Encoder Exemplar CNN
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
cvpaper.challenge 13
n Exemplar CNN
➤ Pretext task : (幾何学・⾊)変換に頑健なインスタンスレベルの画像識別
➤ (クラス数=学習画像インスタンス数)であり,普通にSoftmaxで識別していく
ので使⽤できるデータセットの規模がスケールしにくい
➤ 実はInstance Discrimination(後述)と近いこと(2014年時点で)をしている
➤ Geometric matchingなどのtaskでSIFTよりも良い結果
(いきなり)その他
Dosovitskiy et al., “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”, NIPS 2014.
(~ CVPR2017)
2
Fig. 2. Several random transformations applied to one of the
patches extracted from the STL unlabeled dataset. The original
(’seed’) patch is in the top left corner.
-
,
the purpose of object classification, we used transformations
from the following list:
様々な変換後の,ある画像インスタンス.
これを⼀つのクラスと定義.
5
TABLE 1
Classification accuracies on several datasets (in percent). ⇤ Average per-class accuracy1
78.0% ± 0.4%. † Average per-class
accuracy 85.0% ± 0.7%. ‡ Average per-class accuracy 85.8% ± 0.7%.
Algorithm STL-10 CIFAR-10(400) CIFAR-10 Caltech-101 Caltech-256(30) #features
Convolutional K-means Network [32] 60.1 ± 1 70.7 ± 0.7 82.0 — — 8000
Multi-way local pooling [33] — — — 77.3 ± 0.6 41.7 1024 ⇥ 64
Slowness on videos [14] 61.0 — — 74.6 — 556
Hierarchical Matching Pursuit (HMP) [34] 64.5 ± 1 — — — — 1000
Multipath HMP [35] — — — 82.5 ± 0.5 50.7 5000
View-Invariant K-means [16] 63.7 72.6 ± 0.7 81.9 — — 6400
Exemplar-CNN (64c5-64c5-128f) 67.1 ± 0.2 69.7 ± 0.3 76.5 79.8 ± 0.5⇤
42.4 ± 0.3 256
Exemplar-CNN (64c5-128c5-256c5-512f) 72.8 ± 0.4 75.4 ± 0.2 82.2 86.1 ± 0.5†
51.2 ± 0.2 960
Exemplar-CNN (92c5-256c5-512c5-1024f) 74.2 ± 0.4 76.6 ± 0.2 84.3 87.1 ± 0.7‡
53.6 ± 0.2 1884
Supervised state of the art 70.1[36] — 92.0 [37] 91.44 [38] 70.6 [2] —
4.3 Detailed Analysis
We performed additional experiments using the 64c5-64c5-
128f network to study the effect of various design choices in
Exemplar-CNN training and validate the invariance proper-
ties of the learned features.
4.3.1 Number of Surrogate Classes
We varied the number N of surrogate classes between 50
and 32000. As a sanity check, we also tried classification
with random filters. The results are shown in Fig. 3.
Clearly, the classification accuracy increases with the
number of surrogate classes until it reaches an optimum at
about 8000 surrogate classes after which it did not change or
even decreased. This is to be expected: the larger the number
of surrogate classes, the more likely it is to draw very similar
or even identical samples, which are hard or impossible
to discriminate. Few such cases are not detrimental to the
50 100 250 500 1000 2000 4000 8000 1600032000
54
56
58
60
62
64
66
68
Number of classes (log scale)
ClassificationaccuracyonSTL−10
Classification
on STL (± σ)
Validation error on
surrogate data
0
20
40
60
80
100
Erroronvalidationdata
Fig. 3. Influence of the number of surrogate training classes. The val-
idation error on the surrogate data is shown in red. Note the different
y-axes for the two curves.
クラス数(= 画像インスタンス数)
が8000あたりで限界となる
cvpaper.challenge 14
n Context Prediction (CP)
➤ Pretext task : 画像を3×3に分割し,⼆つのパッチの相対位置の8クラス分類
- 重みを共有した枝構造を持つSiameseNetに2つのパッチを⼊⼒
- 枝のCNNを学習済みモデルとして使⽤
➤ Fine-tuningの結果はランダム初期化より少し良い程度
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
learning representations of patches (i.e. object parts) before
learning full objects and scenes, and argue that scene-level
labels can serve as a pretext task. For example, [13] trains
detectors to be sensitive to different geographic locales, but
the actual goal is to discover specific elements of architec-
tural style.
3. Learning Visual Context Prediction
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
only a number of outputs). The LRN parameters follow [32]. All
conv and fc layers are followed by ReLU nonlinearities, except fc9
which feeds into a softmax classifier.
semantic reasoning for each patch separately. When design-
ing the network, we followed AlexNet where possible.
To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
SiameseNet
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
pe-
We
pre-
ing
pro-
red
ject
gly,
pite
on a
ion
.
s as
An
ner-
be
uses
em.
in-
with
as
321
54
876
); Y = 3,X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
Fine-tuning on Pascal VOC
識別系
Doersch et al., “Unsupervised visual representation learning by context prediction”, ICCV 2015.
(~ CVPR2017)
cvpaper.challenge 15
n Jigsaw Puzzle (JP)
➤ Pretext task : パッチをランダムな順に⼊⼒し,正しい順列をクラス識別
- SiameseNetに9つのパッチを同時に⼊⼒
- 順列は膨⼤な数になるのでハミング距離が⼤きくなるように選んだ
1000クラスで学習
➤ CPはパッチによってはかなりあいまい性がある(下図)
➤ ネットワークが⾒れるパッチが多い⽅があいまい性が減る
➤ CPと⽐較するとかなり精度が改善している
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
JP 67.7 53.2 —
①や②の⑤を基準とした
相対位置を推定するのはかなり難しい
P. Favaro
(b) (c)
representations by solving Jigsaw puzzles. (a) The image
marked with green lines) are extracted. (b) A puzzle ob-
① ➁
⑤
Noroozi et al., “Unsupervised learning of visual representations by solving jigsaw puzzles ”, ECCV 2016.
(~ CVPR2017)
cvpaper.challenge 16
n ⾼次な情報を必要としないPretext taskの解法
➤ しかし,実際に捉えてほしいのは⾼次(semantic)な情報
➤ パッチ境界の低レベルな情報のみで
相対位置の推定が可能?
- パッチ間にgapをつける
- パッチ位置をjittering
➤ ⾊収差によって相対位置の推定が可能?
- ランダムに2チャネルをGaussian noise
に置き換え
trivial solution
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre-
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener-
ate images according to their natural distribution, and be
concise in the sense that it would seek common causes
for different images and share information between them.
However, inferring the latent structure given an image is in-
tractable for even relatively simple models. To deal with
these computational issues, a number of works, such as
the wake-sleep algorithm [25], contrastive divergence [24],
deep Boltzmann machines [48], and variational Bayesian
methods [30, 46] use sampling to perform approximate in-
ference. Generative models have shown promising per-
formance on smaller datasets such as handwritten dig-
its [25, 24, 48, 30, 46], but none have proven effective for
321
54
876
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
problem of determining whether the predictions themselves
are correct [12], unless one cares about predicting only very
low-level features [14, 33, 53]. To address this, [39] predicts
the appearance of an image region by consensus voting of
the transitive nearest neighbors of its surrounding regions.
Our previous work [12] explicitly formulates a statistical
t area has been apertured on a 96x96 size Figure 4. On the left is an example of the famous
例えば…
⾊収差の例
学習時にチャネル間の「収差」を
得られなくする
境界やその外挿で判断できなく
する
cvpaper.challenge 17
n Context Encoder (CE)
➤ Pretext task : ⽋損画像の補完
- Adversarial Loss + L2 Lossを提案しているが,表現学習の実験は
L2 Lossのみ
- つまりただの回帰
➤ ネットワークは表現学習の段階で⽋損画像しか⾒ていない
- しかしTarget taskでは⽋損していない画像を⼊⼒する
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
CE 56.5 44.5 29.7
JP 67.7 53.2 —
-
t
-
-
-
e
-
r
s
,
y
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
Pathak et al., “Context encoders: Feature learning by inpainting ”, CVPR 2016.
(~ CVPR2017)
cvpaper.challenge 18
n Colorful Image Colorization (CC)
➤ Pretext task : グレースケール画像の⾊付け {L => ab}
➤ 単純な回帰ではなく,量⼦化したab空間の識別問題を解く
➤ グレースケール画像⼊⼒を前提として表現学習するため,カラー画像
を扱う場合は,Lab⼊⼒とし,abチャネルはランダムに初期化
n Split-Brain (SB)
➤ ネットワークをチャネル⽅向に2分割し,
{L => ab, ab => L} のアンサンブル
➤ 回帰ではなく量⼦化して識別問題に
する⽅が良い特徴表現が得られた
再構成系
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
between conv blocks.
[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and
encourage interested readers to investigate both concurrent papers.
2 Approach
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in Figure 2. Architectural de-
tails are described in the supplementary materials on our project webpage1
, and
the model is publicly available. In the following, we focus on the design of the
objective function, and our technique for inferring point estimates of color from
Cls. Det. Seg.
random 53.3 43.4 19.8
CC 65.9 46.9 35.6
SB 67.1 46.7 36.0
JP 67.7 53.2 —
Input	Image X Predicted	Image X"
L Grayscale	Channel X#
ab Color	Channels X$ Predicted	Grayscale	Channel X#
%
Predicted	Color	Channels X$
%
(a) Lab Images
Figure 2: Split-Brain Autoencoders applied to various dom
Zhang et al., “Colorful Image Colorization”, ECCV 2016.
Zhang et al., “Split-brain autoencoders: Unsupervised learning by cross-channel prediction”, CVPR 2017.
(~ CVPR2017)
cvpaper.challenge 19
n DCGAN
➤ Pretext task : 画像⽣成モデルの学習
- 質の⾼い⽣成を可能とするテクニックを主にアーキテクチャの観点
から提案
- データ分布を⾼い性能でモデル化 => 良い特徴を捉えている
➤ Discriminatorの中間出⼒を表現に利⽤
➤ ImageNet => Pascal VOCでの実験はなし
➤ CIFAR-10においてExemplar CNNと⽐較
⽣成モデル系
on CIFAR-10
acc. (%) Num of feature
Ex CNN 84.3 1024
DCGAN 82.8 512
Radford et al., “UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS”, ICLR 2016.
(~ CVPR2017)
Under review as a conference paper at ICLR 2016
アーキテクチャや表現学習に
使⽤しているデータセットが
異なるため対等な評価とは⾔えない
~ CVPR 2018
cvpaper.challenge 21
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Spot Artifact
Colorization
Split-brain
VAE系
GAN系Jigsaw
Jigsaw++
Rotation
CountingAutoEncoder系
Context Encoder
Instance
Discrimination
Noise as target
Exemplar CNN
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
cvpaper.challenge 22
n BiGAN
➤ 通常の𝑝(𝑥|𝑧)のみをモデル化(Generator)するGANと異なり,潜在変数の
推論𝑝(𝑧|𝑥)もモデル化 (Encoder)
➤ Generatorによる同時分布(𝑝0 𝑥, 𝑧 = 𝑝0 𝑥|𝑧 𝑝(𝑧))とEncoderによる同時分布
(𝑝1 𝑥, 𝑧 = 𝑝1 𝑧|𝑥 𝑝(𝑥))を通常のGANと同様の枠組みで近づける
➤ 特徴表現としてDの中間出⼒を使⽤する通常のGANよりも良好な結果
- Dはデータ分布とそれ以外を汎⽤的に識別するものではない
⽣成モデル系
Cls. Det. Seg.
random 53.3 43.4 19.8
BiGAN 60.3 46.9 35.2
JP 67.7 53.2 —
Donahue et al., “ADVERSARIAL FEATURE LEARNING”, ICLR 2017.
(~ CVPR2017)
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
generator maps latent samples to generated data, but the framework does not include an inverse
mapping from data to latent representation.
Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative
cvpaper.challenge 23
n Learning to Count (LC)
➤ Pretext task : 以下の制約を満たす特徴量を学習
➤ 制約:各分割画像と元画像をそれぞれ同じCNNに⼊⼒し,元画像の出⼒
特徴が全分割画像の出⼒特徴の和と⼀致する
=> 出⼒特徴の各次元が画像内の「ある⾼次なprimitive」の量を表す場合に
上記の制約を満たすことができる
➤ 個⼈的にかなり⾯⽩いアイデア
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP 67.7 53.2 —
neurons
0 100 200 300 400 500 600 700 800 900 1000
averagemagnitude
0
0.2
0.4
0.6
0.8
1
Figure 3: Average response of our trained network on
the ImageNet validation set. Despite its sparsity (30 non
zero entries), the hidden representation in the trained net-
work performs well when transferred to the classification,
detection and segmentation tasks.
Method Ref Class. Det.
Supervised [20] [43] 79.9 56.8
Random [33] 53.3 43.4
Context [9] [19] 55.3 46.6
Context [9]∗ [19] 65.3 51.1
Jigsaw [30] [30] 67.6 53.2
ego-motion [1] [1] 52.9 41.8
ego-motion [1]∗ [1] 54.2 43.9
Adversarial [10]∗ [10] 58.6 46.2
ContextEncoder [33] [33] 56.5 44.5
Sound [31] [44] 54.4 44.0
Sound [31]∗ [44] 61.3 -
Video [41] [19] 62.8 47.4
Video [41]∗ [19] 63.1 47.2
Colorization [43]∗ [43] 65.9 46.9
Split-Brain [44]∗ [44] 67.1 46.7
ColorProxy [22] [22] 65.9 -
特徴量がprimitiveのヒストグラムのようなものになる
Noroozi et al., “Representation Learning by Learning to Count”, ICCV 2017.
同じ⼈
(~ CVPR2018)
cvpaper.challenge 24
n Noise as target (NAT)
➤ Pretext task : ⼀様にサンプリングされたtarget vectorsに各画像からの出⼒
を1対1に対応させ,近づける
- Targetは全体サンプルの誤差の和が最⼩になるように割り当てたい
- 全⾛査は厳しいのでバッチごとにハンガリアン法で近似的に割り当て
➤ ⼀⾒意味不明だが,画像の特徴ベクトルを特徴空間上に⼀様に分散させる
ことに意味があるらしい (Appendix参照)
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
NAT 65.3 49.4 36.6
JP 67.7 53.2 —
Bojanowski et al., “Unsupervised Learning by Predicting Noise”, ICML 2017.
Unsupervised Learning by Predicting Noise
Target space
Features AssignmentImages
cj
Pf(X)
CNN
Figure 1. Our approach takes a set of images, computes their deep
Choosing the loss function. In the supervised setting, a
popular choice for the loss ` is the softmax function. How
ever, computing this loss is linear in the number of targets
making it impractical for large output spaces (Goodman
2001). While there are workarounds to scale these losses to
large output spaces, Tygert et al. (2017) has recently shown
that using a squared `2 distance works well in many su
pervised settings, as long as the final activations are uni
normalized. This loss only requires access to a single tar
get per sample, making its computation independent of the
number of targets. This leads to the following problem:
min
✓
min
Y 2Rn⇥d
1
2n
kf✓(X) Y k2
F , (2
where we still denote by f✓(X) the unit normalized fea
tures.
データ数分,⼀様分布から
サンプリング(固定)
Unsupervised Learning by Predicting Noise
Figure 3. Images and their 3 nearest neighbors in ImageNet according to our model using an `2 distance. The query images are shown on
the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures.
Figure 4. Filters form the first layer of an AlexNet trained on Im-
ageNet with supervision (left) or with NAT (right). The filters
are in grayscale, since we use grayscale gradient images as input.
This visualization shows the composition of the gradients with the
the bird.
4.2. Comparison with the state of the art
We report results on the transfer task both on ImageNet and
PASCAL VOC 2007. In both cases, the model is trained on
ImageNet.
ImageNet classification. In this experiment, we evaluate
the quality of our features for the object classification task
of ImageNet. Note that in this setup, we build the unsuper-
vised features on images that correspond to predefined im-
age categories. Even though we do not have access to cat-
egory labels, the data itself is biased towards these classes.
In order to evaluate the features, we freeze the layers up
to the last convolutional layer and train the classifier with
supervision. This experimental setting follows Noroozi &
Favaro (2016).
Nearest Neighbor
(~ CVPR2018)
cvpaper.challenge 25
n Instance Discrimination (ID)
➤ Pretext task : 各画像インスタンスを1つのクラスとした識別問題
- 実際はクラス数が膨⼤のため,NCEを⽤いる
- Logitを前iterationの各画像特徴と⼊⼒画像特徴の内積とした時の
cross entropyを最⼩化
➤ 最適な状態としては各画像の特徴ベクトルが超球上にまばらに散るような
埋め込みになるはず (Appendix参照)
=> NATとかなり近いことをしていることになるはず (引⽤はなし)
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
ID — 48.1 —
JP 67.7 53.2 —
Wu et al., “Unsupervised Feature Learning via Non-Parametric Instance Discrimination ”, CVPR 2018.
前iterの
各画像特徴
1-th image
2-th image
i-th image
n-1 th image
n-th image
CNN backbone
128D
2048D
128D
L2 normlow dim
Non-param
Softmax
Memory
Bank
(~ CVPR2018)
CVPR2018
cvpaper.challenge 26
n Spot Artifact (SA)
➤ Pretext task : 特徴マップ上で⽋損させた画像の補完
- ⽋損を補完するrepair layersとdiscriminator間で敵対的学習
- 事前にAuto encoderとして学習したモデルの
特徴マップを⽤いる
- discriminatorが良い特徴表現を得ることを期待
➤ 特徴マップを⽋損はより⾼次な情報を⽋損させる
ことを期待 (実際の⽋損画像を⾒てもあまりわからない)
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
SA 69.8 52.5 38.1
JP 67.7 53.2 —
Real/Corrupt
X + + + + +
Figure 2. The proposed architecture. Two autoencoders {E, D1, D2, D3, D4, D5} output either real images (top row) or images with
artifacts (bottom row). A discriminator C is trained to distinguish them. The corrupted images are generated by masking the encoded
feature φ(x) and then by using a repair network {R1, R2, R3, R4, R5} distributed across the layers of the decoder. The mask is also used
by the repair network to change only the dropped entries of the feature (see Figure 5 for more details). The discriminator and the repair
network (both shaded in blue) are trained in an adversarial fashion on the real/corrupt classification loss. The discriminator is also trained
to output the mask used to drop feature entries, so that it learns to localize all artifacts.
Repair layerを挟む
⽋損位置を推定
Wu et al., “Self-Supervised Feature Learning by
Learning to Spot Artifacts ”, CVPR 2018.
⾚:corrupt,緑:real
(~ CVPR2018)
CVPR2018
cvpaper.challenge 27
n Jigsaw Puzzle++
➤ Pretext task : 1~3パッチを他の画像のパッチに置き換えたJP
- ⾒れるパッチが少ない・他画像からのパッチを識別する必要がある
- 上記からpretext taskの難度が上がる
- 複数のクラスに属することがないようハミング距離を考慮して順列を選択
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP++ 69.8 55.5 38.1
JP 67.7 53.2 —
cluster. Our
space and to
n the dataset
work with the
learn a novel
Figure 2 and
Suppose that
set. Our first
task with the
he models of
one consid-
ayer (shown
feature vec-
t. Then, we
n distance to
y, when per-
we want the
ories. In the
centers com-
(a) (b)
(c) (d)
Figure 3: The Jigsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
Noroozi et al., “Boosting Self-Supervised Learning via
Knowledge Transfer ”, CVPR 2018.
同じ⼈
(~ CVPR2018)
CVPR2018
cvpaper.challenge 28
n Classify Rotation (CR)
➤ Pretext task : 画像の回転推定
- 0°,90°,180°,270°の4クラス分類
- それ以上の細かい分類は回転後に補間が必要
=> artifactが⽣まれ,trivial solutionの原因となる
➤ objectの回転⾓を推定するためにはobjectの⾼次な情報が必要
➤ ここまでの最⾼精度(Cls., Det. ) & 実装が最も簡単
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X
3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F
2
( X
2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning.
Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train
a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input.
y y⇤
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
(~ CVPR2018)
cvpaper.challenge 29
n Classify Rotation (CR)
➤ データ構造への依存
➤ 画像ドメインによっては低次な特徴で回転の推定が可能では?
- 実際にPlacesのシーン識別タスクでは奮わない
➤ 回転が定義できないような画像もあるはず
- 航空写真など
識別系
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
Random 11.6 17.1 16.9 16.3 14.1
Random rescaled Kr¨ahenb¨uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6
Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6
Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5
Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1
BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0
Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8
Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7
(Ours) RotNet 18.8 31.7 38.7 38.2 36.5
Table 6: Task & Dataset Generalization: Places top-1 classification with linear layers. We
compare our unsupervised feature learning approach with other unsupervised approaches by training
logistic regression classifiers on top of the feature maps of each layer to perform the 205-way Places
classification task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised
way) on ImageNet. All weights are frozen and feature maps are spatially resized (with adaptive max
pooling) so as to have around 9000 elements. All approaches use AlexNet variants and were pre-
trained on ImageNet without labels except the Place labels, ImageNet labels, and Random entries.
Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6
ImageNet labels 22.7 34.8 38.4 39.4 38.7
Random 15.7 20.3 19.8 19.1 17.5
Random rescaled Kr¨ahenb¨uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0
Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9
Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4
Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7
Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3
BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7
Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5
Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6
(Ours) RotNet 21.5 31.0 35.1 34.6 33.7
classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and
object segmentation tasks of PASCAL VOC.
Implementation details: For those experiments we implemented our RotNet model with an
AlexNet architecture. Our implementation of the AlexNet model does not have local response
normalization units, dropout units, or groups in the colvolutional layers while it includes batch
for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
tra
tower
ation in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
2
amusement park
evator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
train station platform
tower soccer field
cle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amus
elevator door
arch
corral
windmill
bar
cafeteria
arians office
edroom
rence center
field road
fishpond
watering hole
train statio
Indoor Nature Urban
tower
swimming pool
rcase s
shoe shop rainforest
ticle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation inf
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amusem
elevator door
arch
corral
windmill
bar
cafeteria
narians office
bedroom
erence center
field road
fishpond
watering hole
train station
Indoor Nature Urban
tower
swimming pool s
aircase soc
shoe shop rainforest
Places
例えば,空の位置のみで
回転推定できる
(~ CVPR2018)
cvpaper.challenge
Method Conference Classification
(%mAP)
Detection
(%mAP)
Segmentation
(%mIoU)
Random init. — 53.3 43.4 19.8
Context prediction ICCV15 55.3 46.6 —
Context encoder CVPR16 56.5 44.5 29.7
Colorize ECCV16 65.9 46.9 35.6
Jigsaw ECCV16 67.7 53.2 —
Split-Brain CVPR17 67.1 46.7 36.0
NAT ICML17 65.3 49.4 36.6
Counting ICCV17 67.7 51.4 36.6
BiGAN ICLR17 60.1 46.9 34.9
Rotation ICLR18 73.0 54.4 39.1
Spot Artifact CVPR18 69.8 52.5 38.1
Instance Dis. CVPR18 — 48.1 —
Jigsaw++ CVPR18 69.8 55.5 38.1
Supervised — 79.9 59.1 48.0
⽐較
{Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007
cvpaper.challenge
⽐較
{Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007
Method Conference Classification
(%mAP)
Detection
(%mAP)
Segmentation
(%mIoU)
Random init. — 53.3 43.4 19.8
Context prediction ICCV15 55.3 46.6 —
Context encoder CVPR16 56.5 44.5 29.7
Colorize ECCV16 65.9 46.9 35.6
Jigsaw ECCV16 67.7 53.2 —
Split-Brain CVPR17 67.1 46.7 36.0
NAT ICML17 65.3 49.4 36.6
Counting ICCV17 67.7 51.4 36.6
BiGAN ICLR17 60.1 46.9 34.9
Rotation ICLR18 73.0 54.4 39.1
Spot Artifact CVPR18 69.8 52.5 38.1
Instance Dis. CVPR18 — 48.1 —
Jigsaw++ CVPR18 69.8 55.5 38.1
Deep Cluster ECCV18 73.7 55.4 45.1
Supervised — 79.9 59.1 48.0
さらに最新の動向
cvpaper.challenge 33
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
cvpaper.challenge 34
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Deep Clustering for Unsupervised Learning of Visual Features 7
(a) Clustering quality (b) Cluster reassignment (c) Influence of k
ImageNet labelとクラスタの
相互情報量が増加していく
epoch間の相互情報量が増加
=> クラスタ割り当てが安定
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
cvpaper.challenge 35
n Deep INFORMAX (DIM)
➤ ⼊⼒𝒙と特徴ベクトル𝒛の相互情報量𝐼(𝒙; 𝒛)を最⼤化するように学習
- 簡単に⾔うと𝒙と𝒛の依存を⼤きくする
- 実際には𝒛と𝒙の各パッチの相互情報量最⼤化が⼤きな効果を発揮
➤ 𝒙, 𝒛 のpositive or negativeペアの識別をするdiscriminatorをつけて
end-to-endに学習するだけで𝑰(𝒙; 𝒛)の下限を最⼤化することができる
➤ GANのような交互最適化でもないので,実装・学習が簡単
➤ 全ての⼿法との⽐較はしていないが教師あり学習に近い精度
最新動向
Devon Hajelm et al., “Learning deep representations by mutual information estimation and maximization”, arXiv 8/2018.
Figure 1: The base encoder model in the
context of image data. An image (in this
case) is encoded into a convolutional network
until reaching a feature map of M ⇥ M fea-
ture vectors corresponding to M ⇥ M input
patches. These vectors are summarized (for
instance, using additional convolutions and
fully-connected layers) into a single feature
vector, Y . Our goal is to train this network
such that relevant information about the input
is extractable from the high-level features.
Figure 2: Deep INFOMAX (DIM) with a
global MI(X; Y ) objective. Here, we pass
both the high-level feature vector, Y , and the
lower-level M ⇥ M feature map (See Fig-
ure 1) through a discriminator composed of
additional convolutions, flattening, and fully-
connected layers to get the score. Fake sam-
ples are drawn by combining the same feature
vector with a M ⇥ M feature map from an-
other image.
Table 2: Classification accuracy (top 1) results on Tiny Image
DIM with the local objective outperforms all other models presen
accuracy of a fully-supervised classifier with similar with the A
Tiny ImageNet STL
conv fc (4096) Y (64) conv
Fully supervised 36.60
VAE 18.63 16.88 11.93 58.27
AAE 18.04 17.27 11.49 59.54
BiGAN 24.38 20.21 13.06 71.53
NAT 13.70 11.62 1.20 64.32
DIM(G) 11.32 6.34 4.95 42.03
DIM(L) 33.8 34.5 30.7 71.82
Table 3: Extended comparisons on CIFAR10. Linear classific
runs. MS-SSIM is estimated by training a separate decoder usin
Tiny ImageNetにおいて教師ありに近い精度
cvpaper.challenge 36
n Contrastive Predictive Coding (CPC)
➤ 系列情報においてある時点での特徴ベクトル𝑐8と先の⼊⼒𝑥89:間の
相互情報量を最⼤化
➤ こちらはdiscriminatorがN個のペアから1つのpositiveペアを識別する
Nクラス分類を解くことで相互情報量の下界を最⼤化
➤ 画像の場合は図のように特徴マップを上から下⽅向の系列として捉える
➤ 全ての⼿法との⽐較はしていないが実験内では圧倒的な精度
最新動向
Oord et al., “Representation Learning with Contrastive Predictive Coding”, arxiv 6/2018.
Predictions
zt+2
zt+3
zt+4
ct
64 px
256 px
50% overlap
genc - output
gar - output
input image
Figure 4: Visualization of Contrastive Predictive Coding for images (2D adaptation of Figure 1).
To understand the representations extracted by CPC, we measure the phone prediction performance
with a linear classifier trained on top of these features, which shows how linearly separable the
Method Top-1 ACC
Using AlexNet conv5
Video [27] 29.8
Relative Position [11] 30.4
BiGan [34] 34.8
Colorization [10] 35.2
Jigsaw [28] * 38.1
Using ResNet-V2
Motion Segmentation [35] 27.6
Exemplar [35] 31.5
Relative Position [35] 36.2
Colorization [35] 39.6
CPC 48.7
Table 3: ImageNet top-1 unsupervised classifi-
cation results. *Jigsaw is not directly compa-
rable to the other AlexNet results because of
architectural differences.
まとめ
cvpaper.challenge 38
n CVPR2018まで
➤ アイデアベースで多様な⼿法が発表されてきた (お蔵⼊もたくさんあったはず)
➤ 画像のデータ構造に着⽬したSelf-supervised learningが優位だった (Rotation,
Jigsaw…)
n 現在の動き
➤ データ構造に依存しない⼿法がうまくいきはじめた (Deep Cluster, 相互情報量に
着⽬したアプローチ)
➤ データ構造に依存した⼿法は画像データのドメインによってうまくいくかが左右
される考え (rotation on Placesの結果参照)
n 今後の展望
➤ ⼿法的な展望
- データ構造に依存しない⼿法がさらに発展(具体的には想像がつかない)
➤ 研究領域としての展望
- 打倒教師あり学習 (ImageNet pretrainedを超える)
- Task-specificな教師なし学習 (現在もありますが…)
こちらの⽅がデータ構造に着⽬するself-supervised learningと相性が良さそう
まとめ
cvpaper.challenge 39
n ⾯⽩いところ
➤ データさえあればアノテーションせずに学習できるのは夢がある
➤ データ構造を考えながらpretext taskを設定するのは(こちらも)
パズルを解いている感覚がある
n 苦しいところ
➤ 基本的にやってみないとわからない(良し悪しは実験結果のみでわかる)
➤ 評価するのに2重の(pretext と target)チューニングが必要
n 実⽤として
➤ 学習済みモデルとしてはImageNet pretrained modelを使⽤すれば良い⾵潮
➤ しかし,ImageNet pretrained modelが有効でない場合もある
- 画像のドメインがImageNetと⼤きく異なる場合
➤ そういった条件では使いようがありそう
➤ 条件によっては半教師あり学習と競合する場合も
- 教師なしデータ+教師ありデータ
まとめ
Appendix
cvpaper.challenge 41
n InfoMax principle [Barber+, 2003]
➤ データxの良い表現𝑓< 𝑥 = 𝑧は𝜃 = 𝜃>?@ABCDのときに得られる
➤ 表現𝑧の周辺分布のエントロピーが⼀定以下に制限されている条件下
で, 𝑥と𝑧の相互情報量を最⼤化(𝑥の情報を最⼤限保持)
➤ 相互情報量は以下のように書ける
相互情報量の最⼤化
𝜃>?@ABCD
= argmax<:ℍ L MN 𝕀 𝑥, 𝑧
𝕀 ⋅ ∶ 相互情報量, ℍ ⋅ ∶ エントロピー
𝕀 𝑥, 𝑧 = ℍ 𝑧 − ℍ 𝑧|𝑥
cvpaper.challenge 42
n Noise as target (NAT)の場合
➤ 𝑓< 𝑥 = 𝑧はdeterministicな関数のためℍ 𝑧|𝑥 は⼀定
➤ 𝑧の集合を超球上の⼀様なサンプル群に近づけている
=> ℍ 𝑧 を⼤きくしている
➤ 𝑧はユークリッド空間の「単位超球上」に制限されているため,超
球上の⼀様分布のエントロピーがℍ 𝑧 の上限
=> 𝑧の周辺分布のエントロピーが⼀定以下に制限
n Instance discrimination (ID)の場合
➤ NCEを⽤いてインスタンスレベルの識別をしている
➤ CPCの論⽂と照らし合わせて⾒ると,⼊⼒と特徴量の相互情報量
の下限の最⼤化とほとんど同じことをしている
- 特徴抽出とdiscriminatorのパラメータを完全に共有していて,discriminator
としての勾配のみを更新している点が違う
相互情報量の最⼤化
論⽂中ではInfoMax principleとの詳細な関係はほとんど触れられていないが,
発想のベースにはあったのではと考えられる
cvpaper.challenge 43
n Deep INFORMAX (DIM)
➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化
➤ 実験では画像の部分パッチと画像全体の特徴量について最⼤化
すると最も良かった
n Contrastive Predictive Coding (CPC)
➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化
➤ 現在までの系列情報と先の系列情報の相互情報量を最⼤化
相互情報量の最⼤化
従来のNATやIDと異なり,いずれも⽋損情報と全体(もしくは⽋損してる
部分)の情報間で相互情報量の最⼤化を⾏うことが効果を発揮している

More Related Content

教師なし画像特徴表現学習の動向 {Un, Self} supervised representation learning (CVPR 2018 完全読破チャレンジ報告会)

  • 1. 教師なし 画像特徴表現学習の動向 鈴⽊ 智之 (@tomoyukun) 1 http://hirokatsukataoka.net/project/cc/index_cvpaperchallenge.html CVPR 2018 完全読破チャレンジ報告会 cvpaper.challenge勉強会 @Wantedly⽩⾦台オフィス 配布⽤ {Un, Self} supervised representation learning
  • 2. cvpaper.challenge 2 n 鈴⽊ 智之 (すずき ともゆき) ➤ Twitter : @tomoyukun ➤ 所属:慶応⼤ 修⼠2年 - ⻘⽊研究室 - 産総研RA (2017/5~) - cvpapar.challenge (2017/5~) ➤ 研究の興味 - ⾏動認識,表現学習など ➤ 国際発表論⽂ - Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB, CVPR 2018. - Learning Spatiotemporal 3D Convolution with Video Order Self-supervision, ECCVWS 2018. - Semantic Change Detection, ICARCV 2018. ⾃⼰紹介
  • 3. cvpaper.challenge 3 n 教師なし特徴表現学習とは? ➤ 定義 ➤ 評価⽅法 ➤ アプローチの⼤別 n 論⽂紹介 ➤ ~ CVPR 2017 ➤ ~ CVPR 2018 ➤ さらに最新の動向 n まとめ n Appendix ➤ 相互情報量の最⼤化 本⽇の内容 研究を羅列していきます
  • 4. cvpaper.challenge 4 n 今回の特徴表現の良さ=discriminative - 解きたいタスク (target task) に有効なデータの特徴表現を 擬似的なタスク (pretext task) を事前に解くことで獲得する - disentangleなど,他の良さについては問わない n Self-supervised - ⾃動で⽣成できる教師信号を⽤いてpretext taskを定義 - 画像,動画,⾔語,マルチモーダル n Self-supervised以外 (Unsupervised) - データ分布を表現するモデルを学習する (教師はない) 教師なし特徴表現学習とは? 教師がないデータを⽤いてそれらの良い特徴 表現を獲得すること(そのまま) CNNと画像のみを⽤いた教師なし特徴表現学習について
  • 5. cvpaper.challenge 5 n 評価⽅法① : 特徴抽出+識別器 ➤ Pretext taskで学習したモデルを重み固定の特徴抽出器として⽤い, 特徴量のTarget task での性能を測る ➤ 同じデータセット内で評価することが多い - Pretext : ラベルなしImageNet => Target : ラベルありImageNet ➤ AlexNetで評価するのがスタンダード (になってしまっている) どうやって良い特徴表現かを評価する? モデル Pretext task ex. ImageNet w/o labels ex. AlexNet モデル Target task 識 別 器 固定学習 学習 (ex. ImageNet classification) + 画像 データ ラベル 画像 データ
  • 6. cvpaper.challenge 6 n 評価⽅法➁ : Fine-tuning ➤ Pretext taskで学習したパラメータを初期値として⽤い,Target task でFine-tuningした時の性能を測る ➤ 異なるデータセット間で評価を⾏うことが多い - Pretext : ラベルなしImageNet => Target : ラベルありPascal VOC ➤ AlexNetで評価するのがスタンダードなのは評価⽅法①と同様 どうやって良い特徴表現かを評価する? モデル Pretext task ex. ImageNet w/o labels ex. AlexNet Target task 学習 + 画像 データ ラベル モデル 学習 画像 データ 今回はラベルなしImageNet => Pascal VOC*を基準 (ex. Pascal VOC segmentation) * classification : %mAP, detection : %mAP, segmentation : %mIoU
  • 7. cvpaper.challenge 7 Pretext taskの⼤別 Context prediction 識別系 再構成系 ⽣成モデル系 その他 Spot Artifact Colorization Split-brain VAE系 GAN系 Instance Discrimination Jigsaw Jigsaw++ Rotation Counting n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考) n 便宜上の分類であることに注意 ➤ アイデアベースの⼿法が多いこともあり,分類が難しい Autoencoder系 Context Encoder Noise as target Exemplar CNN
  • 8. cvpaper.challenge 8 Pretext taskの⼤別 Context prediction 識別系 再構成系 ⽣成モデル系 Spot Artifact Colorization Split-brain VAE系 GAN系Jigsaw Jigsaw++ Rotation n 識別系 ➤ 教師なしデータ𝑥に対応する,⾃動で得られるカテゴリ𝑡を定義 - 教師ありデータ(𝑥, 𝑡)となる - 𝑥に施された何らかの処理𝜙(⋅)に応じて𝑡を定義する場合が多い - その場合は教師ありデータ(𝜙(𝑥), 𝑡) Autoencoder系 Context Encoder その他 Counting Instance Discrimination Noise as target Exemplar CNN
  • 9. cvpaper.challenge 9 Pretext taskの⼤別 Context prediction 識別系 再構成系 ⽣成モデル系 Spot Artifact Colorization Split-brain VAE系 GAN系Jigsaw Jigsaw++ Rotation n 再構成系 ➤ 𝑥 = {𝑥*, 𝑥+}の⼀部を観測できている状態で𝑥または𝑥+を推定 - 全て観測できている場合がAuto encoder - 回帰学習や条件付き⽣成モデル的アプローチがある Autoencoder系 Context Encoder その他 Counting Instance Discrimination Noise as target Exemplar CNN
  • 10. cvpaper.challenge 10 Pretext taskの⼤別 n ⽣成モデル系 ➤ データ分布𝑝(𝑥)を学習することに付随して表現を獲得 - VAEは潜在変数,GANはdiscriminatorの中間特徴など - (個⼈的には) うまく学習できれば⼀番良い表現を獲得できそう - しかし, 𝑝(𝑥)の学習が難しい (下界の最⼤化,ミニマックス問題) Context prediction 識別系 再構成系 ⽣成モデル系 Spot Artifact Colorization Split-brain VAE系 GAN系Jigsaw Jigsaw++ Rotation Autoencoder系 Context Encoder 基本的にself-supervisedと⾔われない その他 Counting Instance Discrimination Noise as target Exemplar CNN
  • 12. cvpaper.challenge 12 Pretext taskの⼤別 Context prediction 識別系 再構成系 ⽣成モデル系 その他 Colorization Split-brain VAE系 GAN系Jigsaw AutoEncoder系 Context Encoder Exemplar CNN n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考) n 便宜上の分類であることに注意 ➤ アイデアベースの⼿法が多いこともあり,分類が難しい
  • 13. cvpaper.challenge 13 n Exemplar CNN ➤ Pretext task : (幾何学・⾊)変換に頑健なインスタンスレベルの画像識別 ➤ (クラス数=学習画像インスタンス数)であり,普通にSoftmaxで識別していく ので使⽤できるデータセットの規模がスケールしにくい ➤ 実はInstance Discrimination(後述)と近いこと(2014年時点で)をしている ➤ Geometric matchingなどのtaskでSIFTよりも良い結果 (いきなり)その他 Dosovitskiy et al., “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”, NIPS 2014. (~ CVPR2017) 2 Fig. 2. Several random transformations applied to one of the patches extracted from the STL unlabeled dataset. The original (’seed’) patch is in the top left corner. - , the purpose of object classification, we used transformations from the following list: 様々な変換後の,ある画像インスタンス. これを⼀つのクラスと定義. 5 TABLE 1 Classification accuracies on several datasets (in percent). ⇤ Average per-class accuracy1 78.0% ± 0.4%. † Average per-class accuracy 85.0% ± 0.7%. ‡ Average per-class accuracy 85.8% ± 0.7%. Algorithm STL-10 CIFAR-10(400) CIFAR-10 Caltech-101 Caltech-256(30) #features Convolutional K-means Network [32] 60.1 ± 1 70.7 ± 0.7 82.0 — — 8000 Multi-way local pooling [33] — — — 77.3 ± 0.6 41.7 1024 ⇥ 64 Slowness on videos [14] 61.0 — — 74.6 — 556 Hierarchical Matching Pursuit (HMP) [34] 64.5 ± 1 — — — — 1000 Multipath HMP [35] — — — 82.5 ± 0.5 50.7 5000 View-Invariant K-means [16] 63.7 72.6 ± 0.7 81.9 — — 6400 Exemplar-CNN (64c5-64c5-128f) 67.1 ± 0.2 69.7 ± 0.3 76.5 79.8 ± 0.5⇤ 42.4 ± 0.3 256 Exemplar-CNN (64c5-128c5-256c5-512f) 72.8 ± 0.4 75.4 ± 0.2 82.2 86.1 ± 0.5† 51.2 ± 0.2 960 Exemplar-CNN (92c5-256c5-512c5-1024f) 74.2 ± 0.4 76.6 ± 0.2 84.3 87.1 ± 0.7‡ 53.6 ± 0.2 1884 Supervised state of the art 70.1[36] — 92.0 [37] 91.44 [38] 70.6 [2] — 4.3 Detailed Analysis We performed additional experiments using the 64c5-64c5- 128f network to study the effect of various design choices in Exemplar-CNN training and validate the invariance proper- ties of the learned features. 4.3.1 Number of Surrogate Classes We varied the number N of surrogate classes between 50 and 32000. As a sanity check, we also tried classification with random filters. The results are shown in Fig. 3. Clearly, the classification accuracy increases with the number of surrogate classes until it reaches an optimum at about 8000 surrogate classes after which it did not change or even decreased. This is to be expected: the larger the number of surrogate classes, the more likely it is to draw very similar or even identical samples, which are hard or impossible to discriminate. Few such cases are not detrimental to the 50 100 250 500 1000 2000 4000 8000 1600032000 54 56 58 60 62 64 66 68 Number of classes (log scale) ClassificationaccuracyonSTL−10 Classification on STL (± σ) Validation error on surrogate data 0 20 40 60 80 100 Erroronvalidationdata Fig. 3. Influence of the number of surrogate training classes. The val- idation error on the surrogate data is shown in red. Note the different y-axes for the two curves. クラス数(= 画像インスタンス数) が8000あたりで限界となる
  • 14. cvpaper.challenge 14 n Context Prediction (CP) ➤ Pretext task : 画像を3×3に分割し,⼆つのパッチの相対位置の8クラス分類 - 重みを共有した枝構造を持つSiameseNetに2つのパッチを⼊⼒ - 枝のCNNを学習済みモデルとして使⽤ ➤ Fine-tuningの結果はランダム初期化より少し良い程度 cover clusters of, say, foliage. A few subsequent works have attempted to use representations more closely tied to shape [36, 43], but relied on contour extraction, which is difficult in complex images. Many other approaches [22, 29, 16] focus on defining similarity metrics which can be used in more standard clustering algorithms; [45], for instance, re-casts the problem as frequent itemset mining. Geom- etry may also be used to for verifying links between im- ages [44, 6, 23], although this can fail for deformable ob- jects. Video can provide another cue for representation learn- ing. For most scenes, the identity of objects remains un- changed even as appearance changes with time. This kind of temporal coherence has a long history in visual learning literature [18, 59], and contemporaneous work shows strong improvements on modern detection datasets [57]. Finally, our work is related to a line of research on dis- criminative patch mining [13, 50, 28, 37, 52, 11], which has emphasized weak supervision as a means of object discov- ery. Like the current work, they emphasize the utility of learning representations of patches (i.e. object parts) before learning full objects and scenes, and argue that scene-level labels can serve as a pretext task. For example, [13] trains detectors to be sensitive to different geographic locales, but the actual goal is to discover specific elements of architec- tural style. 3. Learning Visual Context Prediction Patch 2Patch 1 pool1 (3x3,96,2)pool1 (3x3,96,2) LRN1LRN1 pool2 (3x3,384,2)pool2 (3x3,384,2) LRN2LRN2 fc6 (4096)fc6 (4096) conv5 (3x3,256,1)conv5 (3x3,256,1) conv4 (3x3,384,1)conv4 (3x3,384,1) conv3 (3x3,384,1)conv3 (3x3,384,1) conv2 (5x5,384,2)conv2 (5x5,384,2) conv1 (11x11,96,4)conv1 (11x11,96,4) fc7 (4096) fc8 (4096) fc9 (8) pool5 (3x3,256,2)pool5 (3x3,256,2) Figure 3. Our architecture for pair classification. Dotted lines in- dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’ stands for a fully-connected one, ‘pool’ is a max-pooling layer, and ‘LRN’ is a local response normalization layer. Numbers in paren- theses are kernel size, number of outputs, and stride (fc layers have only a number of outputs). The LRN parameters follow [32]. All conv and fc layers are followed by ReLU nonlinearities, except fc9 which feeds into a softmax classifier. semantic reasoning for each patch separately. When design- ing the network, we followed AlexNet where possible. To obtain training examples given an image, we sample the first patch uniformly, without any reference to image SiameseNet Cls. Det. Seg. random 53.3 43.4 19.8 CP 55.3 46.6 — pe- We pre- ing pro- red ject gly, pite on a ion . s as An ner- be uses em. in- with as 321 54 876 ); Y = 3,X = ( Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the Fine-tuning on Pascal VOC 識別系 Doersch et al., “Unsupervised visual representation learning by context prediction”, ICCV 2015. (~ CVPR2017)
  • 15. cvpaper.challenge 15 n Jigsaw Puzzle (JP) ➤ Pretext task : パッチをランダムな順に⼊⼒し,正しい順列をクラス識別 - SiameseNetに9つのパッチを同時に⼊⼒ - 順列は膨⼤な数になるのでハミング距離が⼤きくなるように選んだ 1000クラスで学習 ➤ CPはパッチによってはかなりあいまい性がある(下図) ➤ ネットワークが⾒れるパッチが多い⽅があいまい性が減る ➤ CPと⽐較するとかなり精度が改善している 識別系 Cls. Det. Seg. random 53.3 43.4 19.8 CP 55.3 46.6 — JP 67.7 53.2 — ①や②の⑤を基準とした 相対位置を推定するのはかなり難しい P. Favaro (b) (c) representations by solving Jigsaw puzzles. (a) The image marked with green lines) are extracted. (b) A puzzle ob- ① ➁ ⑤ Noroozi et al., “Unsupervised learning of visual representations by solving jigsaw puzzles ”, ECCV 2016. (~ CVPR2017)
  • 16. cvpaper.challenge 16 n ⾼次な情報を必要としないPretext taskの解法 ➤ しかし,実際に捉えてほしいのは⾼次(semantic)な情報 ➤ パッチ境界の低レベルな情報のみで 相対位置の推定が可能? - パッチ間にgapをつける - パッチ位置をjittering ➤ ⾊収差によって相対位置の推定が可能? - ランダムに2チャネルをGaussian noise に置き換え trivial solution occur in a specific spatial configuration (if there is no spe- cific configuration of the parts, then it is “stuff” [1]). We present a ConvNet-based approach to learn a visual repre- sentation from this task. We demonstrate that the resulting visual representation is good for both object detection, pro- viding a significant boost on PASCAL VOC 2007 compared to learning from scratch, as well as for unsupervised object discovery / visual data mining. This means, surprisingly, that our representation generalizes across images, despite being trained using an objective function that operates on a single image at a time. That is, instance-level supervision appears to improve performance on category-level tasks. 2. Related Work One way to think of a good image representation is as the latent variables of an appropriate generative model. An ideal generative model of natural images would both gener- ate images according to their natural distribution, and be concise in the sense that it would seek common causes for different images and share information between them. However, inferring the latent structure given an image is in- tractable for even relatively simple models. To deal with these computational issues, a number of works, such as the wake-sleep algorithm [25], contrastive divergence [24], deep Boltzmann machines [48], and variational Bayesian methods [30, 46] use sampling to perform approximate in- ference. Generative models have shown promising per- formance on smaller datasets such as handwritten dig- its [25, 24, 48, 30, 46], but none have proven effective for 321 54 876 Figure 2. The algorithm receives two patches in one of these eight possible spatial arrangements, without any context, and must then classify which configuration was sampled. model (e.g. a deep network) to predict, from a single word, the n preceding and n succeeding words. In principle, sim- ilar reasoning could be applied in the image domain, a kind of visual “fill in the blank” task, but, again, one runs into the problem of determining whether the predictions themselves are correct [12], unless one cares about predicting only very low-level features [14, 33, 53]. To address this, [39] predicts the appearance of an image region by consensus voting of the transitive nearest neighbors of its surrounding regions. Our previous work [12] explicitly formulates a statistical t area has been apertured on a 96x96 size Figure 4. On the left is an example of the famous 例えば… ⾊収差の例 学習時にチャネル間の「収差」を 得られなくする 境界やその外挿で判断できなく する
  • 17. cvpaper.challenge 17 n Context Encoder (CE) ➤ Pretext task : ⽋損画像の補完 - Adversarial Loss + L2 Lossを提案しているが,表現学習の実験は L2 Lossのみ - つまりただの回帰 ➤ ネットワークは表現学習の段階で⽋損画像しか⾒ていない - しかしTarget taskでは⽋損していない画像を⼊⼒する 再構成系 Cls. Det. Seg. random 53.3 43.4 19.8 CE 56.5 44.5 29.7 JP 67.7 53.2 — - t - - - e - r s , y Figure 2: Context Encoder. The context image is passed through the encoder to obtain features which are connected to the decoder using channel-wise fully-connected layer as Pathak et al., “Context encoders: Feature learning by inpainting ”, CVPR 2016. (~ CVPR2017)
  • 18. cvpaper.challenge 18 n Colorful Image Colorization (CC) ➤ Pretext task : グレースケール画像の⾊付け {L => ab} ➤ 単純な回帰ではなく,量⼦化したab空間の識別問題を解く ➤ グレースケール画像⼊⼒を前提として表現学習するため,カラー画像 を扱う場合は,Lab⼊⼒とし,abチャネルはランダムに初期化 n Split-Brain (SB) ➤ ネットワークをチャネル⽅向に2分割し, {L => ab, ab => L} のアンサンブル ➤ 回帰ではなく量⼦化して識別問題に する⽅が良い特徴表現が得られた 再構成系 4 Zhang, Isola, Efros Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling between conv blocks. [29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and encourage interested readers to investigate both concurrent papers. 2 Approach We train a CNN to map from a grayscale input to a distribution over quantized color value outputs using the architecture shown in Figure 2. Architectural de- tails are described in the supplementary materials on our project webpage1 , and the model is publicly available. In the following, we focus on the design of the objective function, and our technique for inferring point estimates of color from Cls. Det. Seg. random 53.3 43.4 19.8 CC 65.9 46.9 35.6 SB 67.1 46.7 36.0 JP 67.7 53.2 — Input Image X Predicted Image X" L Grayscale Channel X# ab Color Channels X$ Predicted Grayscale Channel X# % Predicted Color Channels X$ % (a) Lab Images Figure 2: Split-Brain Autoencoders applied to various dom Zhang et al., “Colorful Image Colorization”, ECCV 2016. Zhang et al., “Split-brain autoencoders: Unsupervised learning by cross-channel prediction”, CVPR 2017. (~ CVPR2017)
  • 19. cvpaper.challenge 19 n DCGAN ➤ Pretext task : 画像⽣成モデルの学習 - 質の⾼い⽣成を可能とするテクニックを主にアーキテクチャの観点 から提案 - データ分布を⾼い性能でモデル化 => 良い特徴を捉えている ➤ Discriminatorの中間出⼒を表現に利⽤ ➤ ImageNet => Pascal VOCでの実験はなし ➤ CIFAR-10においてExemplar CNNと⽐較 ⽣成モデル系 on CIFAR-10 acc. (%) Num of feature Ex CNN 84.3 1024 DCGAN 82.8 512 Radford et al., “UNSUPERVISED REPRESENTATION LEARNING WITH DEEP CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS”, ICLR 2016. (~ CVPR2017) Under review as a conference paper at ICLR 2016 アーキテクチャや表現学習に 使⽤しているデータセットが 異なるため対等な評価とは⾔えない
  • 21. cvpaper.challenge 21 Pretext taskの⼤別 Context prediction 識別系 再構成系 ⽣成モデル系 その他 Spot Artifact Colorization Split-brain VAE系 GAN系Jigsaw Jigsaw++ Rotation CountingAutoEncoder系 Context Encoder Instance Discrimination Noise as target Exemplar CNN n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考) n 便宜上の分類であることに注意 ➤ アイデアベースの⼿法が多いこともあり,分類が難しい
  • 22. cvpaper.challenge 22 n BiGAN ➤ 通常の𝑝(𝑥|𝑧)のみをモデル化(Generator)するGANと異なり,潜在変数の 推論𝑝(𝑧|𝑥)もモデル化 (Encoder) ➤ Generatorによる同時分布(𝑝0 𝑥, 𝑧 = 𝑝0 𝑥|𝑧 𝑝(𝑧))とEncoderによる同時分布 (𝑝1 𝑥, 𝑧 = 𝑝1 𝑧|𝑥 𝑝(𝑥))を通常のGANと同様の枠組みで近づける ➤ 特徴表現としてDの中間出⼒を使⽤する通常のGANよりも良好な結果 - Dはデータ分布とそれ以外を汎⽤的に識別するものではない ⽣成モデル系 Cls. Det. Seg. random 53.3 43.4 19.8 BiGAN 60.3 46.9 35.2 JP 67.7 53.2 — Donahue et al., “ADVERSARIAL FEATURE LEARNING”, ICLR 2017. (~ CVPR2017) Published as a conference paper at ICLR 2017 features data z G G(z) xEE(x) G(z), z x, E(x) D P(y) Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN). generator maps latent samples to generated data, but the framework does not include an inverse mapping from data to latent representation. Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative
  • 23. cvpaper.challenge 23 n Learning to Count (LC) ➤ Pretext task : 以下の制約を満たす特徴量を学習 ➤ 制約:各分割画像と元画像をそれぞれ同じCNNに⼊⼒し,元画像の出⼒ 特徴が全分割画像の出⼒特徴の和と⼀致する => 出⼒特徴の各次元が画像内の「ある⾼次なprimitive」の量を表す場合に 上記の制約を満たすことができる ➤ 個⼈的にかなり⾯⽩いアイデア その他 Cls. Det. Seg. random 53.3 43.4 19.8 LC 67.7 51.4 36.6 JP 67.7 53.2 — neurons 0 100 200 300 400 500 600 700 800 900 1000 averagemagnitude 0 0.2 0.4 0.6 0.8 1 Figure 3: Average response of our trained network on the ImageNet validation set. Despite its sparsity (30 non zero entries), the hidden representation in the trained net- work performs well when transferred to the classification, detection and segmentation tasks. Method Ref Class. Det. Supervised [20] [43] 79.9 56.8 Random [33] 53.3 43.4 Context [9] [19] 55.3 46.6 Context [9]∗ [19] 65.3 51.1 Jigsaw [30] [30] 67.6 53.2 ego-motion [1] [1] 52.9 41.8 ego-motion [1]∗ [1] 54.2 43.9 Adversarial [10]∗ [10] 58.6 46.2 ContextEncoder [33] [33] 56.5 44.5 Sound [31] [44] 54.4 44.0 Sound [31]∗ [44] 61.3 - Video [41] [19] 62.8 47.4 Video [41]∗ [19] 63.1 47.2 Colorization [43]∗ [43] 65.9 46.9 Split-Brain [44]∗ [44] 67.1 46.7 ColorProxy [22] [22] 65.9 - 特徴量がprimitiveのヒストグラムのようなものになる Noroozi et al., “Representation Learning by Learning to Count”, ICCV 2017. 同じ⼈ (~ CVPR2018)
  • 24. cvpaper.challenge 24 n Noise as target (NAT) ➤ Pretext task : ⼀様にサンプリングされたtarget vectorsに各画像からの出⼒ を1対1に対応させ,近づける - Targetは全体サンプルの誤差の和が最⼩になるように割り当てたい - 全⾛査は厳しいのでバッチごとにハンガリアン法で近似的に割り当て ➤ ⼀⾒意味不明だが,画像の特徴ベクトルを特徴空間上に⼀様に分散させる ことに意味があるらしい (Appendix参照) その他 Cls. Det. Seg. random 53.3 43.4 19.8 NAT 65.3 49.4 36.6 JP 67.7 53.2 — Bojanowski et al., “Unsupervised Learning by Predicting Noise”, ICML 2017. Unsupervised Learning by Predicting Noise Target space Features AssignmentImages cj Pf(X) CNN Figure 1. Our approach takes a set of images, computes their deep Choosing the loss function. In the supervised setting, a popular choice for the loss ` is the softmax function. How ever, computing this loss is linear in the number of targets making it impractical for large output spaces (Goodman 2001). While there are workarounds to scale these losses to large output spaces, Tygert et al. (2017) has recently shown that using a squared `2 distance works well in many su pervised settings, as long as the final activations are uni normalized. This loss only requires access to a single tar get per sample, making its computation independent of the number of targets. This leads to the following problem: min ✓ min Y 2Rn⇥d 1 2n kf✓(X) Y k2 F , (2 where we still denote by f✓(X) the unit normalized fea tures. データ数分,⼀様分布から サンプリング(固定) Unsupervised Learning by Predicting Noise Figure 3. Images and their 3 nearest neighbors in ImageNet according to our model using an `2 distance. The query images are shown on the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures. Figure 4. Filters form the first layer of an AlexNet trained on Im- ageNet with supervision (left) or with NAT (right). The filters are in grayscale, since we use grayscale gradient images as input. This visualization shows the composition of the gradients with the the bird. 4.2. Comparison with the state of the art We report results on the transfer task both on ImageNet and PASCAL VOC 2007. In both cases, the model is trained on ImageNet. ImageNet classification. In this experiment, we evaluate the quality of our features for the object classification task of ImageNet. Note that in this setup, we build the unsuper- vised features on images that correspond to predefined im- age categories. Even though we do not have access to cat- egory labels, the data itself is biased towards these classes. In order to evaluate the features, we freeze the layers up to the last convolutional layer and train the classifier with supervision. This experimental setting follows Noroozi & Favaro (2016). Nearest Neighbor (~ CVPR2018)
  • 25. cvpaper.challenge 25 n Instance Discrimination (ID) ➤ Pretext task : 各画像インスタンスを1つのクラスとした識別問題 - 実際はクラス数が膨⼤のため,NCEを⽤いる - Logitを前iterationの各画像特徴と⼊⼒画像特徴の内積とした時の cross entropyを最⼩化 ➤ 最適な状態としては各画像の特徴ベクトルが超球上にまばらに散るような 埋め込みになるはず (Appendix参照) => NATとかなり近いことをしていることになるはず (引⽤はなし) その他 Cls. Det. Seg. random 53.3 43.4 19.8 ID — 48.1 — JP 67.7 53.2 — Wu et al., “Unsupervised Feature Learning via Non-Parametric Instance Discrimination ”, CVPR 2018. 前iterの 各画像特徴 1-th image 2-th image i-th image n-1 th image n-th image CNN backbone 128D 2048D 128D L2 normlow dim Non-param Softmax Memory Bank (~ CVPR2018) CVPR2018
  • 26. cvpaper.challenge 26 n Spot Artifact (SA) ➤ Pretext task : 特徴マップ上で⽋損させた画像の補完 - ⽋損を補完するrepair layersとdiscriminator間で敵対的学習 - 事前にAuto encoderとして学習したモデルの 特徴マップを⽤いる - discriminatorが良い特徴表現を得ることを期待 ➤ 特徴マップを⽋損はより⾼次な情報を⽋損させる ことを期待 (実際の⽋損画像を⾒てもあまりわからない) 再構成系 Cls. Det. Seg. random 53.3 43.4 19.8 SA 69.8 52.5 38.1 JP 67.7 53.2 — Real/Corrupt X + + + + + Figure 2. The proposed architecture. Two autoencoders {E, D1, D2, D3, D4, D5} output either real images (top row) or images with artifacts (bottom row). A discriminator C is trained to distinguish them. The corrupted images are generated by masking the encoded feature φ(x) and then by using a repair network {R1, R2, R3, R4, R5} distributed across the layers of the decoder. The mask is also used by the repair network to change only the dropped entries of the feature (see Figure 5 for more details). The discriminator and the repair network (both shaded in blue) are trained in an adversarial fashion on the real/corrupt classification loss. The discriminator is also trained to output the mask used to drop feature entries, so that it learns to localize all artifacts. Repair layerを挟む ⽋損位置を推定 Wu et al., “Self-Supervised Feature Learning by Learning to Spot Artifacts ”, CVPR 2018. ⾚:corrupt,緑:real (~ CVPR2018) CVPR2018
  • 27. cvpaper.challenge 27 n Jigsaw Puzzle++ ➤ Pretext task : 1~3パッチを他の画像のパッチに置き換えたJP - ⾒れるパッチが少ない・他画像からのパッチを識別する必要がある - 上記からpretext taskの難度が上がる - 複数のクラスに属することがないようハミング距離を考慮して順列を選択 識別系 Cls. Det. Seg. random 53.3 43.4 19.8 LC 67.7 51.4 36.6 JP++ 69.8 55.5 38.1 JP 67.7 53.2 — cluster. Our space and to n the dataset work with the learn a novel Figure 2 and Suppose that set. Our first task with the he models of one consid- ayer (shown feature vec- t. Then, we n distance to y, when per- we want the ories. In the centers com- (a) (b) (c) (d) Figure 3: The Jigsaw++ task. (a) the main image. (b) a random image. (c) a puzzle from the original formulation Noroozi et al., “Boosting Self-Supervised Learning via Knowledge Transfer ”, CVPR 2018. 同じ⼈ (~ CVPR2018) CVPR2018
  • 28. cvpaper.challenge 28 n Classify Rotation (CR) ➤ Pretext task : 画像の回転推定 - 0°,90°,180°,270°の4クラス分類 - それ以上の細かい分類は回転後に補間が必要 => artifactが⽣まれ,trivial solutionの原因となる ➤ objectの回転⾓を推定するためにはobjectの⾼次な情報が必要 ➤ ここまでの最⾼精度(Cls., Det. ) & 実装が最も簡単 識別系 Cls. Det. Seg. random 53.3 43.4 19.8 CR 73.0 54.4 39.1 JP++ 69.8 55.5 38.1 Published as a conference paper at ICLR 2018 Rotated image: X 0 Rotated image: X 3 Rotated image: X 2 Rotated image: X 1 ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) ConvNet model F(.) Image X Predict 270 degrees rotation (y=3)Rotate 270 degrees g( X , y=3) Rotate 180 degrees g( X , y=2) Rotate 90 degrees g( X , y=1) Rotate 0 degrees g( X , y=0) Maximize prob. F 3 ( X 3 ) Predict 0 degrees rotation (y=0) Maximize prob. F 2 ( X 2 ) Maximize prob. F 1 ( X 1 ) Maximize prob. F 0 ( X 0 ) Predict 180 degrees rotation (y=2) Predict 90 degrees rotation (y=1) Objectives: Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning. Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input. y y⇤ Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018. (~ CVPR2018)
  • 29. cvpaper.challenge 29 n Classify Rotation (CR) ➤ データ構造への依存 ➤ 画像ドメインによっては低次な特徴で回転の推定が可能では? - 実際にPlacesのシーン識別タスクでは奮わない ➤ 回転が定義できないような画像もあるはず - 航空写真など 識別系 Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018. Random 11.6 17.1 16.9 16.3 14.1 Random rescaled Kr¨ahenb¨uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6 Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6 Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5 Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3 Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1 BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0 Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8 Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7 (Ours) RotNet 18.8 31.7 38.7 38.2 36.5 Table 6: Task & Dataset Generalization: Places top-1 classification with linear layers. We compare our unsupervised feature learning approach with other unsupervised approaches by training logistic regression classifiers on top of the feature maps of each layer to perform the 205-way Places classification task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised way) on ImageNet. All weights are frozen and feature maps are spatially resized (with adaptive max pooling) so as to have around 9000 elements. All approaches use AlexNet variants and were pre- trained on ImageNet without labels except the Place labels, ImageNet labels, and Random entries. Method Conv1 Conv2 Conv3 Conv4 Conv5 Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6 ImageNet labels 22.7 34.8 38.4 39.4 38.7 Random 15.7 20.3 19.8 19.1 17.5 Random rescaled Kr¨ahenb¨uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0 Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9 Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4 Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7 Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3 BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7 Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5 Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6 (Ours) RotNet 21.5 31.0 35.1 34.6 33.7 classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and object segmentation tasks of PASCAL VOC. Implementation details: For those experiments we implemented our RotNet model with an AlexNet architecture. Our implementation of the AlexNet model does not have local response normalization units, dropout units, or groups in the colvolutional layers while it includes batch for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence elevator door arch corral windmill bar cafeteria field road fishpond watering hole tra tower ation in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence 2 amusement park evator door arch corral windmill bar cafeteria field road fishpond watering hole train station platform tower soccer field cle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence amus elevator door arch corral windmill bar cafeteria arians office edroom rence center field road fishpond watering hole train statio Indoor Nature Urban tower swimming pool rcase s shoe shop rainforest ticle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation inf 10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence amusem elevator door arch corral windmill bar cafeteria narians office bedroom erence center field road fishpond watering hole train station Indoor Nature Urban tower swimming pool s aircase soc shoe shop rainforest Places 例えば,空の位置のみで 回転推定できる (~ CVPR2018)
  • 30. cvpaper.challenge Method Conference Classification (%mAP) Detection (%mAP) Segmentation (%mIoU) Random init. — 53.3 43.4 19.8 Context prediction ICCV15 55.3 46.6 — Context encoder CVPR16 56.5 44.5 29.7 Colorize ECCV16 65.9 46.9 35.6 Jigsaw ECCV16 67.7 53.2 — Split-Brain CVPR17 67.1 46.7 36.0 NAT ICML17 65.3 49.4 36.6 Counting ICCV17 67.7 51.4 36.6 BiGAN ICLR17 60.1 46.9 34.9 Rotation ICLR18 73.0 54.4 39.1 Spot Artifact CVPR18 69.8 52.5 38.1 Instance Dis. CVPR18 — 48.1 — Jigsaw++ CVPR18 69.8 55.5 38.1 Supervised — 79.9 59.1 48.0 ⽐較 {Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007
  • 31. cvpaper.challenge ⽐較 {Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007 Method Conference Classification (%mAP) Detection (%mAP) Segmentation (%mIoU) Random init. — 53.3 43.4 19.8 Context prediction ICCV15 55.3 46.6 — Context encoder CVPR16 56.5 44.5 29.7 Colorize ECCV16 65.9 46.9 35.6 Jigsaw ECCV16 67.7 53.2 — Split-Brain CVPR17 67.1 46.7 36.0 NAT ICML17 65.3 49.4 36.6 Counting ICCV17 67.7 51.4 36.6 BiGAN ICLR17 60.1 46.9 34.9 Rotation ICLR18 73.0 54.4 39.1 Spot Artifact CVPR18 69.8 52.5 38.1 Instance Dis. CVPR18 — 48.1 — Jigsaw++ CVPR18 69.8 55.5 38.1 Deep Cluster ECCV18 73.7 55.4 45.1 Supervised — 79.9 59.1 48.0
  • 33. cvpaper.challenge 33 n Deep Cluster (DC) ➤ 以下の操作を繰り返し⾏う ① CNNの中間特徴を元にk-meansクラスタリング ② 割り当てられたクラスタをPseudo labelとして識別問題を学習 ➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング - その出⼒を⽤いてMLPを学習しても12%出る => ⼊⼒情報はある程度保持されてる ➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い ➤ 単純かつ⾮常に強⼒な⼿法 最新動向 Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018. Cls. Det. Seg. random 53.3 43.4 19.8 CR 73.0 54.4 39.1 JP++ 69.8 55.5 38.1 DC 73.7 55.4 45.1
  • 34. cvpaper.challenge 34 n Deep Cluster (DC) ➤ 以下の操作を繰り返し⾏う ① CNNの中間特徴を元にk-meansクラスタリング ② 割り当てられたクラスタをPseudo labelとして識別問題を学習 ➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング - その出⼒を⽤いてMLPを学習しても12%出る => ⼊⼒情報はある程度保持されてる ➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い ➤ 単純かつ⾮常に強⼒な⼿法 最新動向 Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018. Deep Clustering for Unsupervised Learning of Visual Features 7 (a) Clustering quality (b) Cluster reassignment (c) Influence of k ImageNet labelとクラスタの 相互情報量が増加していく epoch間の相互情報量が増加 => クラスタ割り当てが安定 Cls. Det. Seg. random 53.3 43.4 19.8 CR 73.0 54.4 39.1 JP++ 69.8 55.5 38.1 DC 73.7 55.4 45.1
  • 35. cvpaper.challenge 35 n Deep INFORMAX (DIM) ➤ ⼊⼒𝒙と特徴ベクトル𝒛の相互情報量𝐼(𝒙; 𝒛)を最⼤化するように学習 - 簡単に⾔うと𝒙と𝒛の依存を⼤きくする - 実際には𝒛と𝒙の各パッチの相互情報量最⼤化が⼤きな効果を発揮 ➤ 𝒙, 𝒛 のpositive or negativeペアの識別をするdiscriminatorをつけて end-to-endに学習するだけで𝑰(𝒙; 𝒛)の下限を最⼤化することができる ➤ GANのような交互最適化でもないので,実装・学習が簡単 ➤ 全ての⼿法との⽐較はしていないが教師あり学習に近い精度 最新動向 Devon Hajelm et al., “Learning deep representations by mutual information estimation and maximization”, arXiv 8/2018. Figure 1: The base encoder model in the context of image data. An image (in this case) is encoded into a convolutional network until reaching a feature map of M ⇥ M fea- ture vectors corresponding to M ⇥ M input patches. These vectors are summarized (for instance, using additional convolutions and fully-connected layers) into a single feature vector, Y . Our goal is to train this network such that relevant information about the input is extractable from the high-level features. Figure 2: Deep INFOMAX (DIM) with a global MI(X; Y ) objective. Here, we pass both the high-level feature vector, Y , and the lower-level M ⇥ M feature map (See Fig- ure 1) through a discriminator composed of additional convolutions, flattening, and fully- connected layers to get the score. Fake sam- ples are drawn by combining the same feature vector with a M ⇥ M feature map from an- other image. Table 2: Classification accuracy (top 1) results on Tiny Image DIM with the local objective outperforms all other models presen accuracy of a fully-supervised classifier with similar with the A Tiny ImageNet STL conv fc (4096) Y (64) conv Fully supervised 36.60 VAE 18.63 16.88 11.93 58.27 AAE 18.04 17.27 11.49 59.54 BiGAN 24.38 20.21 13.06 71.53 NAT 13.70 11.62 1.20 64.32 DIM(G) 11.32 6.34 4.95 42.03 DIM(L) 33.8 34.5 30.7 71.82 Table 3: Extended comparisons on CIFAR10. Linear classific runs. MS-SSIM is estimated by training a separate decoder usin Tiny ImageNetにおいて教師ありに近い精度
  • 36. cvpaper.challenge 36 n Contrastive Predictive Coding (CPC) ➤ 系列情報においてある時点での特徴ベクトル𝑐8と先の⼊⼒𝑥89:間の 相互情報量を最⼤化 ➤ こちらはdiscriminatorがN個のペアから1つのpositiveペアを識別する Nクラス分類を解くことで相互情報量の下界を最⼤化 ➤ 画像の場合は図のように特徴マップを上から下⽅向の系列として捉える ➤ 全ての⼿法との⽐較はしていないが実験内では圧倒的な精度 最新動向 Oord et al., “Representation Learning with Contrastive Predictive Coding”, arxiv 6/2018. Predictions zt+2 zt+3 zt+4 ct 64 px 256 px 50% overlap genc - output gar - output input image Figure 4: Visualization of Contrastive Predictive Coding for images (2D adaptation of Figure 1). To understand the representations extracted by CPC, we measure the phone prediction performance with a linear classifier trained on top of these features, which shows how linearly separable the Method Top-1 ACC Using AlexNet conv5 Video [27] 29.8 Relative Position [11] 30.4 BiGan [34] 34.8 Colorization [10] 35.2 Jigsaw [28] * 38.1 Using ResNet-V2 Motion Segmentation [35] 27.6 Exemplar [35] 31.5 Relative Position [35] 36.2 Colorization [35] 39.6 CPC 48.7 Table 3: ImageNet top-1 unsupervised classifi- cation results. *Jigsaw is not directly compa- rable to the other AlexNet results because of architectural differences.
  • 38. cvpaper.challenge 38 n CVPR2018まで ➤ アイデアベースで多様な⼿法が発表されてきた (お蔵⼊もたくさんあったはず) ➤ 画像のデータ構造に着⽬したSelf-supervised learningが優位だった (Rotation, Jigsaw…) n 現在の動き ➤ データ構造に依存しない⼿法がうまくいきはじめた (Deep Cluster, 相互情報量に 着⽬したアプローチ) ➤ データ構造に依存した⼿法は画像データのドメインによってうまくいくかが左右 される考え (rotation on Placesの結果参照) n 今後の展望 ➤ ⼿法的な展望 - データ構造に依存しない⼿法がさらに発展(具体的には想像がつかない) ➤ 研究領域としての展望 - 打倒教師あり学習 (ImageNet pretrainedを超える) - Task-specificな教師なし学習 (現在もありますが…) こちらの⽅がデータ構造に着⽬するself-supervised learningと相性が良さそう まとめ
  • 39. cvpaper.challenge 39 n ⾯⽩いところ ➤ データさえあればアノテーションせずに学習できるのは夢がある ➤ データ構造を考えながらpretext taskを設定するのは(こちらも) パズルを解いている感覚がある n 苦しいところ ➤ 基本的にやってみないとわからない(良し悪しは実験結果のみでわかる) ➤ 評価するのに2重の(pretext と target)チューニングが必要 n 実⽤として ➤ 学習済みモデルとしてはImageNet pretrained modelを使⽤すれば良い⾵潮 ➤ しかし,ImageNet pretrained modelが有効でない場合もある - 画像のドメインがImageNetと⼤きく異なる場合 ➤ そういった条件では使いようがありそう ➤ 条件によっては半教師あり学習と競合する場合も - 教師なしデータ+教師ありデータ まとめ
  • 41. cvpaper.challenge 41 n InfoMax principle [Barber+, 2003] ➤ データxの良い表現𝑓< 𝑥 = 𝑧は𝜃 = 𝜃>?@ABCDのときに得られる ➤ 表現𝑧の周辺分布のエントロピーが⼀定以下に制限されている条件下 で, 𝑥と𝑧の相互情報量を最⼤化(𝑥の情報を最⼤限保持) ➤ 相互情報量は以下のように書ける 相互情報量の最⼤化 𝜃>?@ABCD = argmax<:ℍ L MN 𝕀 𝑥, 𝑧 𝕀 ⋅ ∶ 相互情報量, ℍ ⋅ ∶ エントロピー 𝕀 𝑥, 𝑧 = ℍ 𝑧 − ℍ 𝑧|𝑥
  • 42. cvpaper.challenge 42 n Noise as target (NAT)の場合 ➤ 𝑓< 𝑥 = 𝑧はdeterministicな関数のためℍ 𝑧|𝑥 は⼀定 ➤ 𝑧の集合を超球上の⼀様なサンプル群に近づけている => ℍ 𝑧 を⼤きくしている ➤ 𝑧はユークリッド空間の「単位超球上」に制限されているため,超 球上の⼀様分布のエントロピーがℍ 𝑧 の上限 => 𝑧の周辺分布のエントロピーが⼀定以下に制限 n Instance discrimination (ID)の場合 ➤ NCEを⽤いてインスタンスレベルの識別をしている ➤ CPCの論⽂と照らし合わせて⾒ると,⼊⼒と特徴量の相互情報量 の下限の最⼤化とほとんど同じことをしている - 特徴抽出とdiscriminatorのパラメータを完全に共有していて,discriminator としての勾配のみを更新している点が違う 相互情報量の最⼤化 論⽂中ではInfoMax principleとの詳細な関係はほとんど触れられていないが, 発想のベースにはあったのではと考えられる
  • 43. cvpaper.challenge 43 n Deep INFORMAX (DIM) ➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化 ➤ 実験では画像の部分パッチと画像全体の特徴量について最⼤化 すると最も良かった n Contrastive Predictive Coding (CPC) ➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化 ➤ 現在までの系列情報と先の系列情報の相互情報量を最⼤化 相互情報量の最⼤化 従来のNATやIDと異なり,いずれも⽋損情報と全体(もしくは⽋損してる 部分)の情報間で相互情報量の最⼤化を⾏うことが効果を発揮している