鈴⽊ 智之 (@tomoyukun)
CVPR 2018 完全読破チャレンジ報告会 cvpaper.challenge勉強会
{Un, Self} supervised representation learning
n 鈴⽊ 智之 (すずき ともゆき)
➤ Twitter : @tomoyukun
➤ 所属:慶応⼤ 修⼠2年
- ⻘⽊研究室
- 産総研RA (2017/5~)
- cvpapar.challenge (2017/5~)
➤ 研究の興味
- ⾏動認識,表現学習など
➤ 国際発表論⽂
- Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DB,
CVPR 2018.
- Learning Spatiotemporal 3D Convolution with Video Order Self-supervision,
ECCVWS 2018.
- Semantic Change Detection, ICARCV 2018.
n 教師なし特徴表現学習とは?
➤ 定義
➤ 評価⽅法
➤ アプローチの⼤別
n 論⽂紹介
➤ ~ CVPR 2017
➤ ~ CVPR 2018
➤ さらに最新の動向
n まとめ
n Appendix
➤ 相互情報量の最⼤化
n 今回の特徴表現の良さ=discriminative
- 解きたいタスク (target task) に有効なデータの特徴表現を
擬似的なタスク (pretext task) を事前に解くことで獲得する
- disentangleなど,他の良さについては問わない
n Self-supervised
- ⾃動で⽣成できる教師信号を⽤いてpretext taskを定義
- 画像,動画,⾔語,マルチモーダル
n Self-supervised以外 (Unsupervised)
- データ分布を表現するモデルを学習する (教師はない)
n 評価⽅法① : 特徴抽出+識別器
➤ Pretext taskで学習したモデルを重み固定の特徴抽出器として⽤い,
特徴量のTarget task での性能を測る
➤ 同じデータセット内で評価することが多い
- Pretext : ラベルなしImageNet => Target : ラベルありImageNet
➤ AlexNetで評価するのがスタンダード (になってしまっている)
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
Target task
固定学習 学習
(ex. ImageNet classification)
データ ラベル
n 評価⽅法➁ : Fine-tuning
➤ Pretext taskで学習したパラメータを初期値として⽤い,Target task
➤ 異なるデータセット間で評価を⾏うことが多い
- Pretext : ラベルなしImageNet => Target : ラベルありPascal VOC
➤ AlexNetで評価するのがスタンダードなのは評価⽅法①と同様
Pretext task
ex. ImageNet
w/o labels
ex. AlexNet
Target task
データ ラベル
学習 画像
今回はラベルなしImageNet => Pascal VOC*を基準
(ex. Pascal VOC segmentation)
* classification : %mAP, detection : %mAP, segmentation : %mIoU
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Spot Artifact
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
Context Encoder
Noise as target
Exemplar CNN
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
n 識別系
➤ 教師なしデータ𝑥に対応する,⾃動で得られるカテゴリ𝑡を定義
- 教師ありデータ(𝑥, 𝑡)となる
- 𝑥に施された何らかの処理𝜙(⋅)に応じて𝑡を定義する場合が多い
- その場合は教師ありデータ(𝜙(𝑥), 𝑡)
Context Encoder
Noise as target
Exemplar CNN
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
n 再構成系
➤ 𝑥 = {𝑥*, 𝑥+}の⼀部を観測できている状態で𝑥または𝑥+を推定
- 全て観測できている場合がAuto encoder
- 回帰学習や条件付き⽣成モデル的アプローチがある
Context Encoder
Noise as target
Exemplar CNN
Pretext taskの⼤別
n ⽣成モデル系
➤ データ分布𝑝(𝑥)を学習することに付随して表現を獲得
- VAEは潜在変数,GANはdiscriminatorの中間特徴など
- (個⼈的には) うまく学習できれば⼀番良い表現を獲得できそう
- しかし, 𝑝(𝑥)の学習が難しい (下界の最⼤化,ミニマックス問題)
Context prediction
識別系 再構成系 ⽣成モデル系
Spot Artifact
Context Encoder
Noise as target
Exemplar CNN
~ CVPR 2017
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Context Encoder Exemplar CNN
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
n Exemplar CNN
➤ Pretext task : (幾何学・⾊)変換に頑健なインスタンスレベルの画像識別
➤ (クラス数=学習画像インスタンス数)であり,普通にSoftmaxで識別していく
➤ 実はInstance Discrimination(後述)と近いこと(2014年時点で)をしている
➤ Geometric matchingなどのtaskでSIFTよりも良い結果
Dosovitskiy et al., “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”, NIPS 2014.
(~ CVPR2017)
Fig. 2. Several random transformations applied to one of the
patches extracted from the STL unlabeled dataset. The original
(’seed’) patch is in the top left corner.
the purpose of object classification, we used transformations
from the following list:
Classification accuracies on several datasets (in percent). ⇤ Average per-class accuracy1
78.0% ± 0.4%. † Average per-class
accuracy 85.0% ± 0.7%. ‡ Average per-class accuracy 85.8% ± 0.7%.
Algorithm STL-10 CIFAR-10(400) CIFAR-10 Caltech-101 Caltech-256(30) #features
Convolutional K-means Network [32] 60.1 ± 1 70.7 ± 0.7 82.0 — — 8000
Multi-way local pooling [33] — — — 77.3 ± 0.6 41.7 1024 ⇥ 64
Slowness on videos [14] 61.0 — — 74.6 — 556
Hierarchical Matching Pursuit (HMP) [34] 64.5 ± 1 — — — — 1000
Multipath HMP [35] — — — 82.5 ± 0.5 50.7 5000
View-Invariant K-means [16] 63.7 72.6 ± 0.7 81.9 — — 6400
Exemplar-CNN (64c5-64c5-128f) 67.1 ± 0.2 69.7 ± 0.3 76.5 79.8 ± 0.5⇤
42.4 ± 0.3 256
Exemplar-CNN (64c5-128c5-256c5-512f) 72.8 ± 0.4 75.4 ± 0.2 82.2 86.1 ± 0.5†
51.2 ± 0.2 960
Exemplar-CNN (92c5-256c5-512c5-1024f) 74.2 ± 0.4 76.6 ± 0.2 84.3 87.1 ± 0.7‡
53.6 ± 0.2 1884
Supervised state of the art 70.1[36] — 92.0 [37] 91.44 [38] 70.6 [2] —
4.3 Detailed Analysis
We performed additional experiments using the 64c5-64c5-
128f network to study the effect of various design choices in
Exemplar-CNN training and validate the invariance proper-
ties of the learned features.
4.3.1 Number of Surrogate Classes
We varied the number N of surrogate classes between 50
and 32000. As a sanity check, we also tried classification
with random filters. The results are shown in Fig. 3.
Clearly, the classification accuracy increases with the
number of surrogate classes until it reaches an optimum at
about 8000 surrogate classes after which it did not change or
even decreased. This is to be expected: the larger the number
of surrogate classes, the more likely it is to draw very similar
or even identical samples, which are hard or impossible
to discriminate. Few such cases are not detrimental to the
50 100 250 500 1000 2000 4000 8000 1600032000
Number of classes (log scale)
on STL (± σ)
Validation error on
surrogate data
Fig. 3. Influence of the number of surrogate training classes. The val-
idation error on the surrogate data is shown in red. Note the different
y-axes for the two curves.
クラス数(= 画像インスタンス数)
n Context Prediction (CP)
➤ Pretext task : 画像を3×3に分割し,⼆つのパッチの相対位置の8クラス分類
- 重みを共有した枝構造を持つSiameseNetに2つのパッチを⼊⼒
- 枝のCNNを学習済みモデルとして使⽤
➤ Fine-tuningの結果はランダム初期化より少し良い程度
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
learning representations of patches (i.e. object parts) before
learning full objects and scenes, and argue that scene-level
labels can serve as a pretext task. For example, [13] trains
detectors to be sensitive to different geographic locales, but
the actual goal is to discover specific elements of architec-
tural style.
3. Learning Visual Context Prediction
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
pool2 (3x3,384,2)pool2 (3x3,384,2)
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
only a number of outputs). The LRN parameters follow [32]. All
conv and fc layers are followed by ReLU nonlinearities, except fc9
which feeds into a softmax classifier.
semantic reasoning for each patch separately. When design-
ing the network, we followed AlexNet where possible.
To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
on a
s as
); Y = 3,X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
Fine-tuning on Pascal VOC
Doersch et al., “Unsupervised visual representation learning by context prediction”, ICCV 2015.
(~ CVPR2017)
n Jigsaw Puzzle (JP)
➤ Pretext task : パッチをランダムな順に⼊⼒し,正しい順列をクラス識別
- SiameseNetに9つのパッチを同時に⼊⼒
- 順列は膨⼤な数になるのでハミング距離が⼤きくなるように選んだ
➤ CPはパッチによってはかなりあいまい性がある(下図)
➤ ネットワークが⾒れるパッチが多い⽅があいまい性が減る
➤ CPと⽐較するとかなり精度が改善している
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
JP 67.7 53.2 —
P. Favaro
(b) (c)
representations by solving Jigsaw puzzles. (a) The image
marked with green lines) are extracted. (b) A puzzle ob-
① ➁
Noroozi et al., “Unsupervised learning of visual representations by solving jigsaw puzzles ”, ECCV 2016.
(~ CVPR2017)
n ⾼次な情報を必要としないPretext taskの解法
➤ しかし,実際に捉えてほしいのは⾼次(semantic)な情報
➤ パッチ境界の低レベルな情報のみで
- パッチ間にgapをつける
- パッチ位置をjittering
➤ ⾊収差によって相対位置の推定が可能?
- ランダムに2チャネルをGaussian noise
trivial solution
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre-
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener-
ate images according to their natural distribution, and be
concise in the sense that it would seek common causes
for different images and share information between them.
However, inferring the latent structure given an image is in-
tractable for even relatively simple models. To deal with
these computational issues, a number of works, such as
the wake-sleep algorithm [25], contrastive divergence [24],
deep Boltzmann machines [48], and variational Bayesian
methods [30, 46] use sampling to perform approximate in-
ference. Generative models have shown promising per-
formance on smaller datasets such as handwritten dig-
its [25, 24, 48, 30, 46], but none have proven effective for
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
problem of determining whether the predictions themselves
are correct [12], unless one cares about predicting only very
low-level features [14, 33, 53]. To address this, [39] predicts
the appearance of an image region by consensus voting of
the transitive nearest neighbors of its surrounding regions.
Our previous work [12] explicitly formulates a statistical
t area has been apertured on a 96x96 size Figure 4. On the left is an example of the famous
n Context Encoder (CE)
➤ Pretext task : ⽋損画像の補完
- Adversarial Loss + L2 Lossを提案しているが,表現学習の実験は
L2 Lossのみ
- つまりただの回帰
➤ ネットワークは表現学習の段階で⽋損画像しか⾒ていない
- しかしTarget taskでは⽋損していない画像を⼊⼒する
Cls. Det. Seg.
random 53.3 43.4 19.8
CE 56.5 44.5 29.7
JP 67.7 53.2 —
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
Pathak et al., “Context encoders: Feature learning by inpainting ”, CVPR 2016.
(~ CVPR2017)
n Colorful Image Colorization (CC)
➤ Pretext task : グレースケール画像の⾊付け {L => ab}
➤ 単純な回帰ではなく,量⼦化したab空間の識別問題を解く
➤ グレースケール画像⼊⼒を前提として表現学習するため,カラー画像
n Split-Brain (SB)
➤ ネットワークをチャネル⽅向に2分割し,
{L => ab, ab => L} のアンサンブル
➤ 回帰ではなく量⼦化して識別問題に
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
between conv blocks.
[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and
encourage interested readers to investigate both concurrent papers.
2 Approach
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in Figure 2. Architectural de-
tails are described in the supplementary materials on our project webpage1
, and
the model is publicly available. In the following, we focus on the design of the
objective function, and our technique for inferring point estimates of color from
Cls. Det. Seg.
random 53.3 43.4 19.8
CC 65.9 46.9 35.6
SB 67.1 46.7 36.0
JP 67.7 53.2 —
Input	Image X Predicted	Image X"
L Grayscale	Channel X#
ab Color	Channels X$ Predicted	Grayscale	Channel X#
Predicted	Color	Channels X$
(a) Lab Images
Figure 2: Split-Brain Autoencoders applied to various dom
Zhang et al., “Colorful Image Colorization”, ECCV 2016.
Zhang et al., “Split-brain autoencoders: Unsupervised learning by cross-channel prediction”, CVPR 2017.
(~ CVPR2017)
➤ Pretext task : 画像⽣成モデルの学習
- 質の⾼い⽣成を可能とするテクニックを主にアーキテクチャの観点
- データ分布を⾼い性能でモデル化 => 良い特徴を捉えている
➤ Discriminatorの中間出⼒を表現に利⽤
➤ ImageNet => Pascal VOCでの実験はなし
➤ CIFAR-10においてExemplar CNNと⽐較
on CIFAR-10
acc. (%) Num of feature
Ex CNN 84.3 1024
DCGAN 82.8 512
(~ CVPR2017)
Under review as a conference paper at ICLR 2016
~ CVPR 2018
Pretext taskの⼤別
Context prediction
識別系 再構成系 ⽣成モデル系 その他
Spot Artifact
Context Encoder
Noise as target
Exemplar CNN
n CVPR2018までの研究を⼤別 ([Noroozi +, ICCV17]を参考)
n 便宜上の分類であることに注意
➤ アイデアベースの⼿法が多いこともあり,分類が難しい
➤ 通常の𝑝(𝑥|𝑧)のみをモデル化(Generator)するGANと異なり,潜在変数の
推論𝑝(𝑧|𝑥)もモデル化 (Encoder)
➤ Generatorによる同時分布(𝑝0 𝑥, 𝑧 = 𝑝0 𝑥|𝑧 𝑝(𝑧))とEncoderによる同時分布
(𝑝1 𝑥, 𝑧 = 𝑝1 𝑧|𝑥 𝑝(𝑥))を通常のGANと同様の枠組みで近づける
➤ 特徴表現としてDの中間出⼒を使⽤する通常のGANよりも良好な結果
- Dはデータ分布とそれ以外を汎⽤的に識別するものではない
Cls. Det. Seg.
random 53.3 43.4 19.8
BiGAN 60.3 46.9 35.2
JP 67.7 53.2 —
(~ CVPR2017)
Published as a conference paper at ICLR 2017
features data
z G G(z)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
generator maps latent samples to generated data, but the framework does not include an inverse
mapping from data to latent representation.
Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative
cvpaper.challenge 23
n Learning to Count (LC)
➤ Pretext task : 以下の制約を満たす特徴量を学習
➤ 制約:各分割画像と元画像をそれぞれ同じCNNに⼊⼒し,元画像の出⼒
=> 出⼒特徴の各次元が画像内の「ある⾼次なprimitive」の量を表す場合に
➤ 個⼈的にかなり⾯⽩いアイデア
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP 67.7 53.2 —
0 100 200 300 400 500 600 700 800 900 1000
Figure 3: Average response of our trained network on
the ImageNet validation set. Despite its sparsity (30 non
zero entries), the hidden representation in the trained net-
work performs well when transferred to the classification,
detection and segmentation tasks.
Method Ref Class. Det.
Supervised [20] [43] 79.9 56.8
Random [33] 53.3 43.4
Context [9] [19] 55.3 46.6
Context [9]∗ [19] 65.3 51.1
Jigsaw [30] [30] 67.6 53.2
ego-motion [1] [1] 52.9 41.8
ego-motion [1]∗ [1] 54.2 43.9
Adversarial [10]∗ [10] 58.6 46.2
ContextEncoder [33] [33] 56.5 44.5
Sound [31] [44] 54.4 44.0
Sound [31]∗ [44] 61.3 -
Video [41] [19] 62.8 47.4
Video [41]∗ [19] 63.1 47.2
Colorization [43]∗ [43] 65.9 46.9
Split-Brain [44]∗ [44] 67.1 46.7
ColorProxy [22] [22] 65.9 -
Noroozi et al., “Representation Learning by Learning to Count”, ICCV 2017.
(~ CVPR2018)
n Noise as target (NAT)
➤ Pretext task : ⼀様にサンプリングされたtarget vectorsに各画像からの出⼒
- Targetは全体サンプルの誤差の和が最⼩になるように割り当てたい
- 全⾛査は厳しいのでバッチごとにハンガリアン法で近似的に割り当て
➤ ⼀⾒意味不明だが,画像の特徴ベクトルを特徴空間上に⼀様に分散させる
ことに意味があるらしい (Appendix参照)
Cls. Det. Seg.
random 53.3 43.4 19.8
NAT 65.3 49.4 36.6
JP 67.7 53.2 —
Bojanowski et al., “Unsupervised Learning by Predicting Noise”, ICML 2017.
Unsupervised Learning by Predicting Noise
Target space
Features AssignmentImages
Figure 1. Our approach takes a set of images, computes their deep
Choosing the loss function. In the supervised setting, a
popular choice for the loss ` is the softmax function. How
ever, computing this loss is linear in the number of targets
making it impractical for large output spaces (Goodman
2001). While there are workarounds to scale these losses to
large output spaces, Tygert et al. (2017) has recently shown
that using a squared `2 distance works well in many su
pervised settings, as long as the final activations are uni
normalized. This loss only requires access to a single tar
get per sample, making its computation independent of the
number of targets. This leads to the following problem:
Y 2Rn⇥d
kf✓(X) Y k2
F , (2
where we still denote by f✓(X) the unit normalized fea
Unsupervised Learning by Predicting Noise
Figure 3. Images and their 3 nearest neighbors in ImageNet according to our model using an `2 distance. The query images are shown on
the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures.
Figure 4. Filters form the first layer of an AlexNet trained on Im-
ageNet with supervision (left) or with NAT (right). The filters
are in grayscale, since we use grayscale gradient images as input.
This visualization shows the composition of the gradients with the
the bird.
4.2. Comparison with the state of the art
We report results on the transfer task both on ImageNet and
PASCAL VOC 2007. In both cases, the model is trained on
ImageNet classification. In this experiment, we evaluate
the quality of our features for the object classification task
of ImageNet. Note that in this setup, we build the unsuper-
vised features on images that correspond to predefined im-
age categories. Even though we do not have access to cat-
egory labels, the data itself is biased towards these classes.
In order to evaluate the features, we freeze the layers up
to the last convolutional layer and train the classifier with
supervision. This experimental setting follows Noroozi &
Favaro (2016).
Nearest Neighbor
(~ CVPR2018)
n Instance Discrimination (ID)
➤ Pretext task : 各画像インスタンスを1つのクラスとした識別問題
- 実際はクラス数が膨⼤のため,NCEを⽤いる
- Logitを前iterationの各画像特徴と⼊⼒画像特徴の内積とした時の
cross entropyを最⼩化
➤ 最適な状態としては各画像の特徴ベクトルが超球上にまばらに散るような
埋め込みになるはず (Appendix参照)
=> NATとかなり近いことをしていることになるはず (引⽤はなし)
Cls. Det. Seg.
random 53.3 43.4 19.8
ID — 48.1 —
JP 67.7 53.2 —
Wu et al., “Unsupervised Feature Learning via Non-Parametric Instance Discrimination ”, CVPR 2018.
1-th image
2-th image
i-th image
n-1 th image
n-th image
CNN backbone
L2 normlow dim
(~ CVPR2018)
n Spot Artifact (SA)
➤ Pretext task : 特徴マップ上で⽋損させた画像の補完
- ⽋損を補完するrepair layersとdiscriminator間で敵対的学習
- 事前にAuto encoderとして学習したモデルの
- discriminatorが良い特徴表現を得ることを期待
➤ 特徴マップを⽋損はより⾼次な情報を⽋損させる
ことを期待 (実際の⽋損画像を⾒てもあまりわからない)
Cls. Det. Seg.
random 53.3 43.4 19.8
SA 69.8 52.5 38.1
JP 67.7 53.2 —
X + + + + +
Figure 2. The proposed architecture. Two autoencoders {E, D1, D2, D3, D4, D5} output either real images (top row) or images with
artifacts (bottom row). A discriminator C is trained to distinguish them. The corrupted images are generated by masking the encoded
feature φ(x) and then by using a repair network {R1, R2, R3, R4, R5} distributed across the layers of the decoder. The mask is also used
by the repair network to change only the dropped entries of the feature (see Figure 5 for more details). The discriminator and the repair
network (both shaded in blue) are trained in an adversarial fashion on the real/corrupt classification loss. The discriminator is also trained
to output the mask used to drop feature entries, so that it learns to localize all artifacts.
Repair layerを挟む
Wu et al., “Self-Supervised Feature Learning by
Learning to Spot Artifacts ”, CVPR 2018.
(~ CVPR2018)
n Jigsaw Puzzle++
➤ Pretext task : 1~3パッチを他の画像のパッチに置き換えたJP
- ⾒れるパッチが少ない・他画像からのパッチを識別する必要がある
- 上記からpretext taskの難度が上がる
- 複数のクラスに属することがないようハミング距離を考慮して順列を選択
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP++ 69.8 55.5 38.1
JP 67.7 53.2 —
cluster. Our
space and to
n the dataset
work with the
learn a novel
Figure 2 and
Suppose that
set. Our first
task with the
he models of
one consid-
ayer (shown
feature vec-
t. Then, we
n distance to
y, when per-
we want the
ories. In the
centers com-
(a) (b)
(c) (d)
Figure 3: The Jigsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
Noroozi et al., “Boosting Self-Supervised Learning via
Knowledge Transfer ”, CVPR 2018.
(~ CVPR2018)
n Classify Rotation (CR)
➤ Pretext task : 画像の回転推定
- 0°,90°,180°,270°の4クラス分類
- それ以上の細かい分類は回転後に補間が必要
=> artifactが⽣まれ,trivial solutionの原因となる
➤ objectの回転⾓を推定するためにはobjectの⾼次な情報が必要
➤ ここまでの最⾼精度(Cls., Det. ) & 実装が最も簡単
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
Published as a conference paper at ICLR 2018
Rotated image: X
Rotated image: X
Rotated image: X
Rotated image: X
model F(.)
model F(.)
model F(.)
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
( X
Predict 0 degrees rotation (y=0)
Maximize prob.
( X
Maximize prob.
( X
Maximize prob.
( X
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning.
Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train
a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input.
y y⇤
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
(~ CVPR2018)
n Classify Rotation (CR)
➤ データ構造への依存
➤ 画像ドメインによっては低次な特徴で回転の推定が可能では?
- 実際にPlacesのシーン識別タスクでは奮わない
➤ 回転が定義できないような画像もあるはず
- 航空写真など
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
Random 11.6 17.1 16.9 16.3 14.1
Random rescaled Kr¨ahenb¨uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6
Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6
Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5
Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1
BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0
Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8
Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7
(Ours) RotNet 18.8 31.7 38.7 38.2 36.5
Table 6: Task & Dataset Generalization: Places top-1 classification with linear layers. We
compare our unsupervised feature learning approach with other unsupervised approaches by training
logistic regression classifiers on top of the feature maps of each layer to perform the 205-way Places
classification task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised
way) on ImageNet. All weights are frozen and feature maps are spatially resized (with adaptive max
pooling) so as to have around 9000 elements. All approaches use AlexNet variants and were pre-
trained on ImageNet without labels except the Place labels, ImageNet labels, and Random entries.
Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6
ImageNet labels 22.7 34.8 38.4 39.4 38.7
Random 15.7 20.3 19.8 19.1 17.5
Random rescaled Kr¨ahenb¨uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0
Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9
Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4
Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7
Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3
BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7
Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5
Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6
(Ours) RotNet 21.5 31.0 35.1 34.6 33.7
classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and
object segmentation tasks of PASCAL VOC.
Implementation details: For those experiments we implemented our RotNet model with an
AlexNet architecture. Our implementation of the AlexNet model does not have local response
normalization units, dropout units, or groups in the colvolutional layers while it includes batch
for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
field road
watering hole
ation in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amusement park
evator door
field road
watering hole
train station platform
tower soccer field
cle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
arians office
rence center
field road
watering hole
train statio
Indoor Nature Urban
swimming pool
rcase s
shoe shop rainforest
ticle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation inf
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
narians office
erence center
field road
watering hole
train station
Indoor Nature Urban
swimming pool s
aircase soc
shoe shop rainforest
(~ CVPR2018)
Method Conference Classification
Random init. — 53.3 43.4 19.8
Context prediction ICCV15 55.3 46.6 —
Context encoder CVPR16 56.5 44.5 29.7
Colorize ECCV16 65.9 46.9 35.6
Jigsaw ECCV16 67.7 53.2 —
Split-Brain CVPR17 67.1 46.7 36.0
NAT ICML17 65.3 49.4 36.6
Counting ICCV17 67.7 51.4 36.6
BiGAN ICLR17 60.1 46.9 34.9
Rotation ICLR18 73.0 54.4 39.1
Spot Artifact CVPR18 69.8 52.5 38.1
Instance Dis. CVPR18 — 48.1 —
Jigsaw++ CVPR18 69.8 55.5 38.1
Supervised — 79.9 59.1 48.0
{Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007
{Self, Un}-supervised learning on ImageNet => Fine-tuing on Pascal VOC2007
Method Conference Classification
Random init. — 53.3 43.4 19.8
Context prediction ICCV15 55.3 46.6 —
Context encoder CVPR16 56.5 44.5 29.7
Colorize ECCV16 65.9 46.9 35.6
Jigsaw ECCV16 67.7 53.2 —
Split-Brain CVPR17 67.1 46.7 36.0
NAT ICML17 65.3 49.4 36.6
Counting ICCV17 67.7 51.4 36.6
BiGAN ICLR17 60.1 46.9 34.9
Rotation ICLR18 73.0 54.4 39.1
Spot Artifact CVPR18 69.8 52.5 38.1
Instance Dis. CVPR18 — 48.1 —
Jigsaw++ CVPR18 69.8 55.5 38.1
Deep Cluster ECCV18 73.7 55.4 45.1
Supervised — 79.9 59.1 48.0
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Deep Clustering for Unsupervised Learning of Visual Features 7
(a) Clustering quality (b) Cluster reassignment (c) Influence of k
ImageNet labelとクラスタの
=> クラスタ割り当てが安定
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
➤ ⼊⼒𝒙と特徴ベクトル𝒛の相互情報量𝐼(𝒙; 𝒛)を最⼤化するように学習
- 簡単に⾔うと𝒙と𝒛の依存を⼤きくする
- 実際には𝒛と𝒙の各パッチの相互情報量最⼤化が⼤きな効果を発揮
➤ 𝒙, 𝒛 のpositive or negativeペアの識別をするdiscriminatorをつけて
end-to-endに学習するだけで𝑰(𝒙; 𝒛)の下限を最⼤化することができる
➤ GANのような交互最適化でもないので,実装・学習が簡単
➤ 全ての⼿法との⽐較はしていないが教師あり学習に近い精度
Devon Hajelm et al., “Learning deep representations by mutual information estimation and maximization”, arXiv 8/2018.
Figure 1: The base encoder model in the
context of image data. An image (in this
case) is encoded into a convolutional network
until reaching a feature map of M ⇥ M fea-
ture vectors corresponding to M ⇥ M input
patches. These vectors are summarized (for
instance, using additional convolutions and
fully-connected layers) into a single feature
vector, Y . Our goal is to train this network
such that relevant information about the input
is extractable from the high-level features.
Figure 2: Deep INFOMAX (DIM) with a
global MI(X; Y ) objective. Here, we pass
both the high-level feature vector, Y , and the
lower-level M ⇥ M feature map (See Fig-
ure 1) through a discriminator composed of
additional convolutions, flattening, and fully-
connected layers to get the score. Fake sam-
ples are drawn by combining the same feature
vector with a M ⇥ M feature map from an-
other image.
Table 2: Classification accuracy (top 1) results on Tiny Image
DIM with the local objective outperforms all other models presen
accuracy of a fully-supervised classifier with similar with the A
Tiny ImageNet STL
conv fc (4096) Y (64) conv
Fully supervised 36.60
VAE 18.63 16.88 11.93 58.27
AAE 18.04 17.27 11.49 59.54
BiGAN 24.38 20.21 13.06 71.53
NAT 13.70 11.62 1.20 64.32
DIM(G) 11.32 6.34 4.95 42.03
DIM(L) 33.8 34.5 30.7 71.82
Table 3: Extended comparisons on CIFAR10. Linear classific
runs. MS-SSIM is estimated by training a separate decoder usin
Tiny ImageNetにおいて教師ありに近い精度
cvpaper.challenge 36
n Contrastive Predictive Coding (CPC)
➤ 系列情報においてある時点での特徴ベクトル𝑐8と先の⼊⼒𝑥89:間の
➤ こちらはdiscriminatorがN個のペアから1つのpositiveペアを識別する
➤ 画像の場合は図のように特徴マップを上から下⽅向の系列として捉える
➤ 全ての⼿法との⽐較はしていないが実験内では圧倒的な精度
Oord et al., “Representation Learning with Contrastive Predictive Coding”, arxiv 6/2018.
64 px
256 px
50% overlap
genc - output
gar - output
input image
Figure 4: Visualization of Contrastive Predictive Coding for images (2D adaptation of Figure 1).
To understand the representations extracted by CPC, we measure the phone prediction performance
with a linear classifier trained on top of these features, which shows how linearly separable the
Method Top-1 ACC
Using AlexNet conv5
Video [27] 29.8
Relative Position [11] 30.4
BiGan [34] 34.8
Colorization [10] 35.2
Jigsaw [28] * 38.1
Using ResNet-V2
Motion Segmentation [35] 27.6
Exemplar [35] 31.5
Relative Position [35] 36.2
Colorization [35] 39.6
CPC 48.7
Table 3: ImageNet top-1 unsupervised classifi-
cation results. *Jigsaw is not directly compa-
rable to the other AlexNet results because of
architectural differences.
n CVPR2018まで
➤ アイデアベースで多様な⼿法が発表されてきた (お蔵⼊もたくさんあったはず)
➤ 画像のデータ構造に着⽬したSelf-supervised learningが優位だった (Rotation,
n 現在の動き
➤ データ構造に依存しない⼿法がうまくいきはじめた (Deep Cluster, 相互情報量に
➤ データ構造に依存した⼿法は画像データのドメインによってうまくいくかが左右
される考え (rotation on Placesの結果参照)
n 今後の展望
➤ ⼿法的な展望
- データ構造に依存しない⼿法がさらに発展(具体的には想像がつかない)
➤ 研究領域としての展望
- 打倒教師あり学習 (ImageNet pretrainedを超える)
- Task-specificな教師なし学習 (現在もありますが…)
こちらの⽅がデータ構造に着⽬するself-supervised learningと相性が良さそう
n ⾯⽩いところ
➤ データさえあればアノテーションせずに学習できるのは夢がある
➤ データ構造を考えながらpretext taskを設定するのは(こちらも)
n 苦しいところ
➤ 基本的にやってみないとわからない(良し悪しは実験結果のみでわかる)
➤ 評価するのに2重の(pretext と target)チューニングが必要
n 実⽤として
➤ 学習済みモデルとしてはImageNet pretrained modelを使⽤すれば良い⾵潮
➤ しかし,ImageNet pretrained modelが有効でない場合もある
- 画像のドメインがImageNetと⼤きく異なる場合
➤ そういった条件では使いようがありそう
➤ 条件によっては半教師あり学習と競合する場合も
- 教師なしデータ+教師ありデータ
n InfoMax principle [Barber+, 2003]
➤ データxの良い表現𝑓< 𝑥 = 𝑧は𝜃 = 𝜃>?@ABCDのときに得られる
➤ 表現𝑧の周辺分布のエントロピーが⼀定以下に制限されている条件下
で, 𝑥と𝑧の相互情報量を最⼤化(𝑥の情報を最⼤限保持)
➤ 相互情報量は以下のように書ける
= argmax<:ℍ L MN 𝕀 𝑥, 𝑧
𝕀 ⋅ ∶ 相互情報量, ℍ ⋅ ∶ エントロピー
𝕀 𝑥, 𝑧 = ℍ 𝑧 − ℍ 𝑧|𝑥
cvpaper.challenge 42
n Noise as target (NAT)の場合
➤ 𝑓< 𝑥 = 𝑧はdeterministicな関数のためℍ 𝑧|𝑥 は⼀定
➤ 𝑧の集合を超球上の⼀様なサンプル群に近づけている
=> ℍ 𝑧 を⼤きくしている
➤ 𝑧はユークリッド空間の「単位超球上」に制限されているため,超
球上の⼀様分布のエントロピーがℍ 𝑧 の上限
=> 𝑧の周辺分布のエントロピーが⼀定以下に制限
n Instance discrimination (ID)の場合
➤ NCEを⽤いてインスタンスレベルの識別をしている
➤ CPCの論⽂と照らし合わせて⾒ると,⼊⼒と特徴量の相互情報量
- 特徴抽出とdiscriminatorのパラメータを完全に共有していて,discriminator
論⽂中ではInfoMax principleとの詳細な関係はほとんど触れられていないが,
cvpaper.challenge 43
➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化
➤ 実験では画像の部分パッチと画像全体の特徴量について最⼤化
n Contrastive Predictive Coding (CPC)
➤ 明⽰的に⼊⼒と特徴量間の相互情報量を最⼤化
➤ 現在までの系列情報と先の系列情報の相互情報量を最⼤化

