13. cvpaper.challenge 13
n Exemplar CNN
➤ Pretext task : (幾何学・⾊)変換に頑健なインスタンスレベルの画像識別
➤ (クラス数=学習画像インスタンス数)であり,普通にSoftmaxで識別していく
ので使⽤できるデータセットの規模がスケールしにくい
➤ 実はInstance Discrimination(後述)と近いこと(2014年時点で)をしている
➤ Geometric matchingなどのtaskでSIFTよりも良い結果
(いきなり)その他
Dosovitskiy et al., “Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks”, NIPS 2014.
(~ CVPR2017)
2
Fig. 2. Several random transformations applied to one of the
patches extracted from the STL unlabeled dataset. The original
(’seed’) patch is in the top left corner.
-
,
the purpose of object classification, we used transformations
from the following list:
様々な変換後の,ある画像インスタンス.
これを⼀つのクラスと定義.
5
TABLE 1
Classification accuracies on several datasets (in percent). ⇤ Average per-class accuracy1
78.0% ± 0.4%. † Average per-class
accuracy 85.0% ± 0.7%. ‡ Average per-class accuracy 85.8% ± 0.7%.
Algorithm STL-10 CIFAR-10(400) CIFAR-10 Caltech-101 Caltech-256(30) #features
Convolutional K-means Network [32] 60.1 ± 1 70.7 ± 0.7 82.0 — — 8000
Multi-way local pooling [33] — — — 77.3 ± 0.6 41.7 1024 ⇥ 64
Slowness on videos [14] 61.0 — — 74.6 — 556
Hierarchical Matching Pursuit (HMP) [34] 64.5 ± 1 — — — — 1000
Multipath HMP [35] — — — 82.5 ± 0.5 50.7 5000
View-Invariant K-means [16] 63.7 72.6 ± 0.7 81.9 — — 6400
Exemplar-CNN (64c5-64c5-128f) 67.1 ± 0.2 69.7 ± 0.3 76.5 79.8 ± 0.5⇤
42.4 ± 0.3 256
Exemplar-CNN (64c5-128c5-256c5-512f) 72.8 ± 0.4 75.4 ± 0.2 82.2 86.1 ± 0.5†
51.2 ± 0.2 960
Exemplar-CNN (92c5-256c5-512c5-1024f) 74.2 ± 0.4 76.6 ± 0.2 84.3 87.1 ± 0.7‡
53.6 ± 0.2 1884
Supervised state of the art 70.1[36] — 92.0 [37] 91.44 [38] 70.6 [2] —
4.3 Detailed Analysis
We performed additional experiments using the 64c5-64c5-
128f network to study the effect of various design choices in
Exemplar-CNN training and validate the invariance proper-
ties of the learned features.
4.3.1 Number of Surrogate Classes
We varied the number N of surrogate classes between 50
and 32000. As a sanity check, we also tried classification
with random filters. The results are shown in Fig. 3.
Clearly, the classification accuracy increases with the
number of surrogate classes until it reaches an optimum at
about 8000 surrogate classes after which it did not change or
even decreased. This is to be expected: the larger the number
of surrogate classes, the more likely it is to draw very similar
or even identical samples, which are hard or impossible
to discriminate. Few such cases are not detrimental to the
50 100 250 500 1000 2000 4000 8000 1600032000
54
56
58
60
62
64
66
68
Number of classes (log scale)
ClassificationaccuracyonSTL−10
Classification
on STL (± σ)
Validation error on
surrogate data
0
20
40
60
80
100
Erroronvalidationdata
Fig. 3. Influence of the number of surrogate training classes. The val-
idation error on the surrogate data is shown in red. Note the different
y-axes for the two curves.
クラス数(= 画像インスタンス数)
が8000あたりで限界となる
14. cvpaper.challenge 14
n Context Prediction (CP)
➤ Pretext task : 画像を3×3に分割し,⼆つのパッチの相対位置の8クラス分類
- 重みを共有した枝構造を持つSiameseNetに2つのパッチを⼊⼒
- 枝のCNNを学習済みモデルとして使⽤
➤ Fine-tuningの結果はランダム初期化より少し良い程度
cover clusters of, say, foliage. A few subsequent works have
attempted to use representations more closely tied to shape
[36, 43], but relied on contour extraction, which is difficult
in complex images. Many other approaches [22, 29, 16]
focus on defining similarity metrics which can be used in
more standard clustering algorithms; [45], for instance,
re-casts the problem as frequent itemset mining. Geom-
etry may also be used to for verifying links between im-
ages [44, 6, 23], although this can fail for deformable ob-
jects.
Video can provide another cue for representation learn-
ing. For most scenes, the identity of objects remains un-
changed even as appearance changes with time. This kind
of temporal coherence has a long history in visual learning
literature [18, 59], and contemporaneous work shows strong
improvements on modern detection datasets [57].
Finally, our work is related to a line of research on dis-
criminative patch mining [13, 50, 28, 37, 52, 11], which has
emphasized weak supervision as a means of object discov-
ery. Like the current work, they emphasize the utility of
learning representations of patches (i.e. object parts) before
learning full objects and scenes, and argue that scene-level
labels can serve as a pretext task. For example, [13] trains
detectors to be sensitive to different geographic locales, but
the actual goal is to discover specific elements of architec-
tural style.
3. Learning Visual Context Prediction
Patch 2Patch 1
pool1 (3x3,96,2)pool1 (3x3,96,2)
LRN1LRN1
pool2 (3x3,384,2)pool2 (3x3,384,2)
LRN2LRN2
fc6 (4096)fc6 (4096)
conv5 (3x3,256,1)conv5 (3x3,256,1)
conv4 (3x3,384,1)conv4 (3x3,384,1)
conv3 (3x3,384,1)conv3 (3x3,384,1)
conv2 (5x5,384,2)conv2 (5x5,384,2)
conv1 (11x11,96,4)conv1 (11x11,96,4)
fc7 (4096)
fc8 (4096)
fc9 (8)
pool5 (3x3,256,2)pool5 (3x3,256,2)
Figure 3. Our architecture for pair classification. Dotted lines in-
dicate shared weights. ‘conv’ stands for a convolution layer, ‘fc’
stands for a fully-connected one, ‘pool’ is a max-pooling layer, and
‘LRN’ is a local response normalization layer. Numbers in paren-
theses are kernel size, number of outputs, and stride (fc layers have
only a number of outputs). The LRN parameters follow [32]. All
conv and fc layers are followed by ReLU nonlinearities, except fc9
which feeds into a softmax classifier.
semantic reasoning for each patch separately. When design-
ing the network, we followed AlexNet where possible.
To obtain training examples given an image, we sample
the first patch uniformly, without any reference to image
SiameseNet
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
pe-
We
pre-
ing
pro-
red
ject
gly,
pite
on a
ion
.
s as
An
ner-
be
uses
em.
in-
with
as
321
54
876
); Y = 3,X = (
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
Fine-tuning on Pascal VOC
識別系
Doersch et al., “Unsupervised visual representation learning by context prediction”, ICCV 2015.
(~ CVPR2017)
15. cvpaper.challenge 15
n Jigsaw Puzzle (JP)
➤ Pretext task : パッチをランダムな順に⼊⼒し,正しい順列をクラス識別
- SiameseNetに9つのパッチを同時に⼊⼒
- 順列は膨⼤な数になるのでハミング距離が⼤きくなるように選んだ
1000クラスで学習
➤ CPはパッチによってはかなりあいまい性がある(下図)
➤ ネットワークが⾒れるパッチが多い⽅があいまい性が減る
➤ CPと⽐較するとかなり精度が改善している
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CP 55.3 46.6 —
JP 67.7 53.2 —
①や②の⑤を基準とした
相対位置を推定するのはかなり難しい
P. Favaro
(b) (c)
representations by solving Jigsaw puzzles. (a) The image
marked with green lines) are extracted. (b) A puzzle ob-
① ➁
⑤
Noroozi et al., “Unsupervised learning of visual representations by solving jigsaw puzzles ”, ECCV 2016.
(~ CVPR2017)
16. cvpaper.challenge 16
n ⾼次な情報を必要としないPretext taskの解法
➤ しかし,実際に捉えてほしいのは⾼次(semantic)な情報
➤ パッチ境界の低レベルな情報のみで
相対位置の推定が可能?
- パッチ間にgapをつける
- パッチ位置をjittering
➤ ⾊収差によって相対位置の推定が可能?
- ランダムに2チャネルをGaussian noise
に置き換え
trivial solution
occur in a specific spatial configuration (if there is no spe-
cific configuration of the parts, then it is “stuff” [1]). We
present a ConvNet-based approach to learn a visual repre-
sentation from this task. We demonstrate that the resulting
visual representation is good for both object detection, pro-
viding a significant boost on PASCAL VOC 2007 compared
to learning from scratch, as well as for unsupervised object
discovery / visual data mining. This means, surprisingly,
that our representation generalizes across images, despite
being trained using an objective function that operates on a
single image at a time. That is, instance-level supervision
appears to improve performance on category-level tasks.
2. Related Work
One way to think of a good image representation is as
the latent variables of an appropriate generative model. An
ideal generative model of natural images would both gener-
ate images according to their natural distribution, and be
concise in the sense that it would seek common causes
for different images and share information between them.
However, inferring the latent structure given an image is in-
tractable for even relatively simple models. To deal with
these computational issues, a number of works, such as
the wake-sleep algorithm [25], contrastive divergence [24],
deep Boltzmann machines [48], and variational Bayesian
methods [30, 46] use sampling to perform approximate in-
ference. Generative models have shown promising per-
formance on smaller datasets such as handwritten dig-
its [25, 24, 48, 30, 46], but none have proven effective for
321
54
876
Figure 2. The algorithm receives two patches in one of these eight
possible spatial arrangements, without any context, and must then
classify which configuration was sampled.
model (e.g. a deep network) to predict, from a single word,
the n preceding and n succeeding words. In principle, sim-
ilar reasoning could be applied in the image domain, a kind
of visual “fill in the blank” task, but, again, one runs into the
problem of determining whether the predictions themselves
are correct [12], unless one cares about predicting only very
low-level features [14, 33, 53]. To address this, [39] predicts
the appearance of an image region by consensus voting of
the transitive nearest neighbors of its surrounding regions.
Our previous work [12] explicitly formulates a statistical
t area has been apertured on a 96x96 size Figure 4. On the left is an example of the famous
例えば…
⾊収差の例
学習時にチャネル間の「収差」を
得られなくする
境界やその外挿で判断できなく
する
17. cvpaper.challenge 17
n Context Encoder (CE)
➤ Pretext task : ⽋損画像の補完
- Adversarial Loss + L2 Lossを提案しているが,表現学習の実験は
L2 Lossのみ
- つまりただの回帰
➤ ネットワークは表現学習の段階で⽋損画像しか⾒ていない
- しかしTarget taskでは⽋損していない画像を⼊⼒する
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
CE 56.5 44.5 29.7
JP 67.7 53.2 —
-
t
-
-
-
e
-
r
s
,
y
Figure 2: Context Encoder. The context image is passed
through the encoder to obtain features which are connected
to the decoder using channel-wise fully-connected layer as
Pathak et al., “Context encoders: Feature learning by inpainting ”, CVPR 2016.
(~ CVPR2017)
18. cvpaper.challenge 18
n Colorful Image Colorization (CC)
➤ Pretext task : グレースケール画像の⾊付け {L => ab}
➤ 単純な回帰ではなく,量⼦化したab空間の識別問題を解く
➤ グレースケール画像⼊⼒を前提として表現学習するため,カラー画像
を扱う場合は,Lab⼊⼒とし,abチャネルはランダムに初期化
n Split-Brain (SB)
➤ ネットワークをチャネル⽅向に2分割し,
{L => ab, ab => L} のアンサンブル
➤ 回帰ではなく量⼦化して識別問題に
する⽅が良い特徴表現が得られた
再構成系
4 Zhang, Isola, Efros
Fig. 2. Our network architecture. Each conv layer refers to a block of 2 or 3 repeated
conv and ReLU layers, followed by a BatchNorm [30] layer. The net has no pool layers.
All changes in resolution are achieved through spatial downsampling or upsampling
between conv blocks.
[29]. In Section 3.1, we provide quantitative comparisons to Larsson et al., and
encourage interested readers to investigate both concurrent papers.
2 Approach
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in Figure 2. Architectural de-
tails are described in the supplementary materials on our project webpage1
, and
the model is publicly available. In the following, we focus on the design of the
objective function, and our technique for inferring point estimates of color from
Cls. Det. Seg.
random 53.3 43.4 19.8
CC 65.9 46.9 35.6
SB 67.1 46.7 36.0
JP 67.7 53.2 —
Input Image X Predicted Image X"
L Grayscale Channel X#
ab Color Channels X$ Predicted Grayscale Channel X#
%
Predicted Color Channels X$
%
(a) Lab Images
Figure 2: Split-Brain Autoencoders applied to various dom
Zhang et al., “Colorful Image Colorization”, ECCV 2016.
Zhang et al., “Split-brain autoencoders: Unsupervised learning by cross-channel prediction”, CVPR 2017.
(~ CVPR2017)
19. cvpaper.challenge 19
n DCGAN
➤ Pretext task : 画像⽣成モデルの学習
- 質の⾼い⽣成を可能とするテクニックを主にアーキテクチャの観点
から提案
- データ分布を⾼い性能でモデル化 => 良い特徴を捉えている
➤ Discriminatorの中間出⼒を表現に利⽤
➤ ImageNet => Pascal VOCでの実験はなし
➤ CIFAR-10においてExemplar CNNと⽐較
⽣成モデル系
on CIFAR-10
acc. (%) Num of feature
Ex CNN 84.3 1024
DCGAN 82.8 512
Radford et al., “UNSUPERVISED REPRESENTATION LEARNING WITH DEEP
CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS”, ICLR 2016.
(~ CVPR2017)
Under review as a conference paper at ICLR 2016
アーキテクチャや表現学習に
使⽤しているデータセットが
異なるため対等な評価とは⾔えない
22. cvpaper.challenge 22
n BiGAN
➤ 通常の𝑝(𝑥|𝑧)のみをモデル化(Generator)するGANと異なり,潜在変数の
推論𝑝(𝑧|𝑥)もモデル化 (Encoder)
➤ Generatorによる同時分布(𝑝0 𝑥, 𝑧 = 𝑝0 𝑥|𝑧 𝑝(𝑧))とEncoderによる同時分布
(𝑝1 𝑥, 𝑧 = 𝑝1 𝑧|𝑥 𝑝(𝑥))を通常のGANと同様の枠組みで近づける
➤ 特徴表現としてDの中間出⼒を使⽤する通常のGANよりも良好な結果
- Dはデータ分布とそれ以外を汎⽤的に識別するものではない
⽣成モデル系
Cls. Det. Seg.
random 53.3 43.4 19.8
BiGAN 60.3 46.9 35.2
JP 67.7 53.2 —
Donahue et al., “ADVERSARIAL FEATURE LEARNING”, ICLR 2017.
(~ CVPR2017)
Published as a conference paper at ICLR 2017
features data
z G G(z)
xEE(x)
G(z), z
x, E(x)
D P(y)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
generator maps latent samples to generated data, but the framework does not include an inverse
mapping from data to latent representation.
Hence, we propose a novel unsupervised feature learning framework, Bidirectional Generative
23. cvpaper.challenge 23
n Learning to Count (LC)
➤ Pretext task : 以下の制約を満たす特徴量を学習
➤ 制約:各分割画像と元画像をそれぞれ同じCNNに⼊⼒し,元画像の出⼒
特徴が全分割画像の出⼒特徴の和と⼀致する
=> 出⼒特徴の各次元が画像内の「ある⾼次なprimitive」の量を表す場合に
上記の制約を満たすことができる
➤ 個⼈的にかなり⾯⽩いアイデア
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP 67.7 53.2 —
neurons
0 100 200 300 400 500 600 700 800 900 1000
averagemagnitude
0
0.2
0.4
0.6
0.8
1
Figure 3: Average response of our trained network on
the ImageNet validation set. Despite its sparsity (30 non
zero entries), the hidden representation in the trained net-
work performs well when transferred to the classification,
detection and segmentation tasks.
Method Ref Class. Det.
Supervised [20] [43] 79.9 56.8
Random [33] 53.3 43.4
Context [9] [19] 55.3 46.6
Context [9]∗ [19] 65.3 51.1
Jigsaw [30] [30] 67.6 53.2
ego-motion [1] [1] 52.9 41.8
ego-motion [1]∗ [1] 54.2 43.9
Adversarial [10]∗ [10] 58.6 46.2
ContextEncoder [33] [33] 56.5 44.5
Sound [31] [44] 54.4 44.0
Sound [31]∗ [44] 61.3 -
Video [41] [19] 62.8 47.4
Video [41]∗ [19] 63.1 47.2
Colorization [43]∗ [43] 65.9 46.9
Split-Brain [44]∗ [44] 67.1 46.7
ColorProxy [22] [22] 65.9 -
特徴量がprimitiveのヒストグラムのようなものになる
Noroozi et al., “Representation Learning by Learning to Count”, ICCV 2017.
同じ⼈
(~ CVPR2018)
24. cvpaper.challenge 24
n Noise as target (NAT)
➤ Pretext task : ⼀様にサンプリングされたtarget vectorsに各画像からの出⼒
を1対1に対応させ,近づける
- Targetは全体サンプルの誤差の和が最⼩になるように割り当てたい
- 全⾛査は厳しいのでバッチごとにハンガリアン法で近似的に割り当て
➤ ⼀⾒意味不明だが,画像の特徴ベクトルを特徴空間上に⼀様に分散させる
ことに意味があるらしい (Appendix参照)
その他
Cls. Det. Seg.
random 53.3 43.4 19.8
NAT 65.3 49.4 36.6
JP 67.7 53.2 —
Bojanowski et al., “Unsupervised Learning by Predicting Noise”, ICML 2017.
Unsupervised Learning by Predicting Noise
Target space
Features AssignmentImages
cj
Pf(X)
CNN
Figure 1. Our approach takes a set of images, computes their deep
Choosing the loss function. In the supervised setting, a
popular choice for the loss ` is the softmax function. How
ever, computing this loss is linear in the number of targets
making it impractical for large output spaces (Goodman
2001). While there are workarounds to scale these losses to
large output spaces, Tygert et al. (2017) has recently shown
that using a squared `2 distance works well in many su
pervised settings, as long as the final activations are uni
normalized. This loss only requires access to a single tar
get per sample, making its computation independent of the
number of targets. This leads to the following problem:
min
✓
min
Y 2Rn⇥d
1
2n
kf✓(X) Y k2
F , (2
where we still denote by f✓(X) the unit normalized fea
tures.
データ数分,⼀様分布から
サンプリング(固定)
Unsupervised Learning by Predicting Noise
Figure 3. Images and their 3 nearest neighbors in ImageNet according to our model using an `2 distance. The query images are shown on
the top row, and the nearest neighbors are sorted from the closer to the further. Our features seem to capture global distinctive structures.
Figure 4. Filters form the first layer of an AlexNet trained on Im-
ageNet with supervision (left) or with NAT (right). The filters
are in grayscale, since we use grayscale gradient images as input.
This visualization shows the composition of the gradients with the
the bird.
4.2. Comparison with the state of the art
We report results on the transfer task both on ImageNet and
PASCAL VOC 2007. In both cases, the model is trained on
ImageNet.
ImageNet classification. In this experiment, we evaluate
the quality of our features for the object classification task
of ImageNet. Note that in this setup, we build the unsuper-
vised features on images that correspond to predefined im-
age categories. Even though we do not have access to cat-
egory labels, the data itself is biased towards these classes.
In order to evaluate the features, we freeze the layers up
to the last convolutional layer and train the classifier with
supervision. This experimental setting follows Noroozi &
Favaro (2016).
Nearest Neighbor
(~ CVPR2018)
26. cvpaper.challenge 26
n Spot Artifact (SA)
➤ Pretext task : 特徴マップ上で⽋損させた画像の補完
- ⽋損を補完するrepair layersとdiscriminator間で敵対的学習
- 事前にAuto encoderとして学習したモデルの
特徴マップを⽤いる
- discriminatorが良い特徴表現を得ることを期待
➤ 特徴マップを⽋損はより⾼次な情報を⽋損させる
ことを期待 (実際の⽋損画像を⾒てもあまりわからない)
再構成系
Cls. Det. Seg.
random 53.3 43.4 19.8
SA 69.8 52.5 38.1
JP 67.7 53.2 —
Real/Corrupt
X + + + + +
Figure 2. The proposed architecture. Two autoencoders {E, D1, D2, D3, D4, D5} output either real images (top row) or images with
artifacts (bottom row). A discriminator C is trained to distinguish them. The corrupted images are generated by masking the encoded
feature φ(x) and then by using a repair network {R1, R2, R3, R4, R5} distributed across the layers of the decoder. The mask is also used
by the repair network to change only the dropped entries of the feature (see Figure 5 for more details). The discriminator and the repair
network (both shaded in blue) are trained in an adversarial fashion on the real/corrupt classification loss. The discriminator is also trained
to output the mask used to drop feature entries, so that it learns to localize all artifacts.
Repair layerを挟む
⽋損位置を推定
Wu et al., “Self-Supervised Feature Learning by
Learning to Spot Artifacts ”, CVPR 2018.
⾚:corrupt,緑:real
(~ CVPR2018)
CVPR2018
27. cvpaper.challenge 27
n Jigsaw Puzzle++
➤ Pretext task : 1~3パッチを他の画像のパッチに置き換えたJP
- ⾒れるパッチが少ない・他画像からのパッチを識別する必要がある
- 上記からpretext taskの難度が上がる
- 複数のクラスに属することがないようハミング距離を考慮して順列を選択
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
LC 67.7 51.4 36.6
JP++ 69.8 55.5 38.1
JP 67.7 53.2 —
cluster. Our
space and to
n the dataset
work with the
learn a novel
Figure 2 and
Suppose that
set. Our first
task with the
he models of
one consid-
ayer (shown
feature vec-
t. Then, we
n distance to
y, when per-
we want the
ories. In the
centers com-
(a) (b)
(c) (d)
Figure 3: The Jigsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
Noroozi et al., “Boosting Self-Supervised Learning via
Knowledge Transfer ”, CVPR 2018.
同じ⼈
(~ CVPR2018)
CVPR2018
28. cvpaper.challenge 28
n Classify Rotation (CR)
➤ Pretext task : 画像の回転推定
- 0°,90°,180°,270°の4クラス分類
- それ以上の細かい分類は回転後に補間が必要
=> artifactが⽣まれ,trivial solutionの原因となる
➤ objectの回転⾓を推定するためにはobjectの⾼次な情報が必要
➤ ここまでの最⾼精度(Cls., Det. ) & 実装が最も簡単
識別系
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
Published as a conference paper at ICLR 2018
Rotated image: X
0
Rotated image: X
3
Rotated image: X
2
Rotated image: X
1
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
ConvNet
model F(.)
Image X
Predict 270 degrees rotation (y=3)Rotate 270 degrees
g( X , y=3)
Rotate 180 degrees
g( X , y=2)
Rotate 90 degrees
g( X , y=1)
Rotate 0 degrees
g( X , y=0)
Maximize prob.
F
3
( X
3
)
Predict 0 degrees rotation (y=0)
Maximize prob.
F
2
( X
2
)
Maximize prob.
F
1
( X
1
)
Maximize prob.
F
0
( X
0
)
Predict 180 degrees rotation (y=2)
Predict 90 degrees rotation (y=1)
Objectives:
Figure 2: Illustration of the self-supervised task that we propose for semantic feature learning.
Given four possible geometric transformations, the 0, 90, 180, and 270 degrees rotations, we train
a ConvNet model F(.) to recognize the rotation that is applied to the image that it gets as input.
y y⇤
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
(~ CVPR2018)
29. cvpaper.challenge 29
n Classify Rotation (CR)
➤ データ構造への依存
➤ 画像ドメインによっては低次な特徴で回転の推定が可能では?
- 実際にPlacesのシーン識別タスクでは奮わない
➤ 回転が定義できないような画像もあるはず
- 航空写真など
識別系
Gidaris et al., “Unsupervised Representation Learning by predicting Image Rotation”, ICLR 2018.
Random 11.6 17.1 16.9 16.3 14.1
Random rescaled Kr¨ahenb¨uhl et al. (2015) 17.5 23.0 24.5 23.2 20.6
Context (Doersch et al., 2015) 16.2 23.3 30.2 31.7 29.6
Context Encoders (Pathak et al., 2016b) 14.1 20.7 21.0 19.8 15.5
Colorization (Zhang et al., 2016a) 12.5 24.5 30.4 31.5 30.3
Jigsaw Puzzles (Noroozi & Favaro, 2016) 18.2 28.8 34.0 33.9 27.1
BIGAN (Donahue et al., 2016) 17.7 24.5 31.0 29.9 28.0
Split-Brain (Zhang et al., 2016b) 17.7 29.3 35.4 35.2 32.8
Counting (Noroozi et al., 2017) 18.0 30.6 34.3 32.5 25.7
(Ours) RotNet 18.8 31.7 38.7 38.2 36.5
Table 6: Task & Dataset Generalization: Places top-1 classification with linear layers. We
compare our unsupervised feature learning approach with other unsupervised approaches by training
logistic regression classifiers on top of the feature maps of each layer to perform the 205-way Places
classification task (Zhou et al., 2014). All unsupervised methods are pre-trained (in an unsupervised
way) on ImageNet. All weights are frozen and feature maps are spatially resized (with adaptive max
pooling) so as to have around 9000 elements. All approaches use AlexNet variants and were pre-
trained on ImageNet without labels except the Place labels, ImageNet labels, and Random entries.
Method Conv1 Conv2 Conv3 Conv4 Conv5
Places labels Zhou et al. (2014) 22.1 35.1 40.2 43.3 44.6
ImageNet labels 22.7 34.8 38.4 39.4 38.7
Random 15.7 20.3 19.8 19.1 17.5
Random rescaled Kr¨ahenb¨uhl et al. (2015) 21.4 26.2 27.1 26.1 24.0
Context (Doersch et al., 2015) 19.7 26.7 31.9 32.7 30.9
Context Encoders (Pathak et al., 2016b) 18.2 23.2 23.4 21.9 18.4
Colorization (Zhang et al., 2016a) 16.0 25.7 29.6 30.3 29.7
Jigsaw Puzzles (Noroozi & Favaro, 2016) 23.0 31.9 35.0 34.2 29.3
BIGAN (Donahue et al., 2016) 22.0 28.7 31.8 31.3 29.7
Split-Brain (Zhang et al., 2016b) 21.3 30.7 34.0 34.1 32.5
Counting (Noroozi et al., 2017) 23.3 33.9 36.3 34.7 29.6
(Ours) RotNet 21.5 31.0 35.1 34.6 33.7
classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and
object segmentation tasks of PASCAL VOC.
Implementation details: For those experiments we implemented our RotNet model with an
AlexNet architecture. Our implementation of the AlexNet model does not have local response
normalization units, dropout units, or groups in the colvolutional layers while it includes batch
for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
elevator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
tra
tower
ation in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
2
amusement park
evator door
arch
corral
windmill
bar
cafeteria
field road
fishpond
watering hole
train station platform
tower soccer field
cle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amus
elevator door
arch
corral
windmill
bar
cafeteria
arians office
edroom
rence center
field road
fishpond
watering hole
train statio
Indoor Nature Urban
tower
swimming pool
rcase s
shoe shop rainforest
ticle has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation inf
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
amusem
elevator door
arch
corral
windmill
bar
cafeteria
narians office
bedroom
erence center
field road
fishpond
watering hole
train station
Indoor Nature Urban
tower
swimming pool s
aircase soc
shoe shop rainforest
Places
例えば,空の位置のみで
回転推定できる
(~ CVPR2018)
33. cvpaper.challenge 33
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
34. cvpaper.challenge 34
n Deep Cluster (DC)
➤ 以下の操作を繰り返し⾏う
① CNNの中間特徴を元にk-meansクラスタリング
② 割り当てられたクラスタをPseudo labelとして識別問題を学習
➤ 最初のiterationではランダム初期化されたCNNの出⼒を元にクラスタリング
- その出⼒を⽤いてMLPを学習しても12%出る
=> ⼊⼒情報はある程度保持されてる
➤ ImageNetでの実験ではk = 10000 (> 1000)が最も良い
➤ 単純かつ⾮常に強⼒な⼿法
最新動向
Caron et al., “Deep Clustering for Unsupervised Learning of Visual Features ”, ECCV 2018.
Deep Clustering for Unsupervised Learning of Visual Features 7
(a) Clustering quality (b) Cluster reassignment (c) Influence of k
ImageNet labelとクラスタの
相互情報量が増加していく
epoch間の相互情報量が増加
=> クラスタ割り当てが安定
Cls. Det. Seg.
random 53.3 43.4 19.8
CR 73.0 54.4 39.1
JP++ 69.8 55.5 38.1
DC 73.7 55.4 45.1
35. cvpaper.challenge 35
n Deep INFORMAX (DIM)
➤ ⼊⼒𝒙と特徴ベクトル𝒛の相互情報量𝐼(𝒙; 𝒛)を最⼤化するように学習
- 簡単に⾔うと𝒙と𝒛の依存を⼤きくする
- 実際には𝒛と𝒙の各パッチの相互情報量最⼤化が⼤きな効果を発揮
➤ 𝒙, 𝒛 のpositive or negativeペアの識別をするdiscriminatorをつけて
end-to-endに学習するだけで𝑰(𝒙; 𝒛)の下限を最⼤化することができる
➤ GANのような交互最適化でもないので,実装・学習が簡単
➤ 全ての⼿法との⽐較はしていないが教師あり学習に近い精度
最新動向
Devon Hajelm et al., “Learning deep representations by mutual information estimation and maximization”, arXiv 8/2018.
Figure 1: The base encoder model in the
context of image data. An image (in this
case) is encoded into a convolutional network
until reaching a feature map of M ⇥ M fea-
ture vectors corresponding to M ⇥ M input
patches. These vectors are summarized (for
instance, using additional convolutions and
fully-connected layers) into a single feature
vector, Y . Our goal is to train this network
such that relevant information about the input
is extractable from the high-level features.
Figure 2: Deep INFOMAX (DIM) with a
global MI(X; Y ) objective. Here, we pass
both the high-level feature vector, Y , and the
lower-level M ⇥ M feature map (See Fig-
ure 1) through a discriminator composed of
additional convolutions, flattening, and fully-
connected layers to get the score. Fake sam-
ples are drawn by combining the same feature
vector with a M ⇥ M feature map from an-
other image.
Table 2: Classification accuracy (top 1) results on Tiny Image
DIM with the local objective outperforms all other models presen
accuracy of a fully-supervised classifier with similar with the A
Tiny ImageNet STL
conv fc (4096) Y (64) conv
Fully supervised 36.60
VAE 18.63 16.88 11.93 58.27
AAE 18.04 17.27 11.49 59.54
BiGAN 24.38 20.21 13.06 71.53
NAT 13.70 11.62 1.20 64.32
DIM(G) 11.32 6.34 4.95 42.03
DIM(L) 33.8 34.5 30.7 71.82
Table 3: Extended comparisons on CIFAR10. Linear classific
runs. MS-SSIM is estimated by training a separate decoder usin
Tiny ImageNetにおいて教師ありに近い精度
36. cvpaper.challenge 36
n Contrastive Predictive Coding (CPC)
➤ 系列情報においてある時点での特徴ベクトル𝑐8と先の⼊⼒𝑥89:間の
相互情報量を最⼤化
➤ こちらはdiscriminatorがN個のペアから1つのpositiveペアを識別する
Nクラス分類を解くことで相互情報量の下界を最⼤化
➤ 画像の場合は図のように特徴マップを上から下⽅向の系列として捉える
➤ 全ての⼿法との⽐較はしていないが実験内では圧倒的な精度
最新動向
Oord et al., “Representation Learning with Contrastive Predictive Coding”, arxiv 6/2018.
Predictions
zt+2
zt+3
zt+4
ct
64 px
256 px
50% overlap
genc - output
gar - output
input image
Figure 4: Visualization of Contrastive Predictive Coding for images (2D adaptation of Figure 1).
To understand the representations extracted by CPC, we measure the phone prediction performance
with a linear classifier trained on top of these features, which shows how linearly separable the
Method Top-1 ACC
Using AlexNet conv5
Video [27] 29.8
Relative Position [11] 30.4
BiGan [34] 34.8
Colorization [10] 35.2
Jigsaw [28] * 38.1
Using ResNet-V2
Motion Segmentation [35] 27.6
Exemplar [35] 31.5
Relative Position [35] 36.2
Colorization [35] 39.6
CPC 48.7
Table 3: ImageNet top-1 unsupervised classifi-
cation results. *Jigsaw is not directly compa-
rable to the other AlexNet results because of
architectural differences.