5. NLPからVision and Language、CVへ
5
最近はCV、Vision and Languageへの応用が激化
サーベイ論文も立て続けに登場
2020/12/23
A Survey on Visual Transformer (2020)
https://arxiv.org/abs/2012.12556
2021/01/04
Transformers in Vision: A Survey
https://arxiv.org/abs/2101.01169
2021/03/06
Perspectives and Prospects on Transformer Architecture for
Cross-Modal Tasks with Language and Vision
https://arxiv.org/abs/2103.04037
1 Transformer
の躍進
11. 参考文献1
11
参考文献
[H. Zhang+, ICML2018] Zhang, Han, et al. “Self-Attention Generative Adversarial
Networks.” Proceedings of the 36th International Conference on Machine Learning
(2019).
28. Positional embedding関連でおススメの文献
28
2 Transformer
解体新書
Position Information in Transformers: An Overview [P. Dufter+, arXiv2021]
https://arxiv.org/abs/2102.11090
位置embeddingについての網羅的なサーベイ論文
On Position Embeddings in BERT [B. Wang+, ICLR2021]
https://openreview.net/forum?id=onxoVA9FxMw
位置embeddingの性質としてSinusoidal型と学習型を様々な条件で比較
タスクごとのおすすめの設定を示唆している
● 絶対位置が分類タスクで優れている(特殊トークンを柔軟に扱えるため?)
● 相対位置がスパンの予測タスクで優れている
● 分類タスクでは対称性は崩れていた方が性能が良い
What Do Position Embeddings Learn? An Empirical Study of Pre-Trained
Language Model Positional Encoding [Y.A. Wang+, EMNLP2020]
https://arxiv.org/abs/2010.04903
BERT、RoBERTa、GPT-2などの位置embeddingの可視化
39. 参考文献
39
参考文献
[A. Vaswani+, NIPS2017] Vaswani, Ashish et al. “Attention is All you Need.”
Advances in Neural Information Processing Systems 30 (2017).
[J. Gehring+, ICML2017] Gehring, Jonas et al. “Convolutional Sequence to Sequence
Learning.” Proceedings of the 34th International Conference on Machine Learning
(2017).
[B. Wang+, ICLR2021] Wang, Benyou, et al. “On Position Embeddings in BERT.”
Proceedings of International Conference on Learning Representations (2021).
[M. Neishi+, CoNNL2019] Neishi, Masato and Yoshinaga, Naoki “On the Relation
between Position Information and Sentence Length in Neural Machine Translation.”
Proceedings of the 23rd Conference on Computational Natural Language Learning
(2019).
[P. Dufter+, arXiv2021] Dufter, Philipp et al. “Position Information in Transformers: An
Overview.” arXiv preprint arXiv:2102.11090 (2021).
[G. Ke+, ICLR2021] Ke, Guolin, et al. “Rethinking Positional Encoding in Language
Pre-training.” Proceedings of International Conference on Learning Representations
(2021).
[Y.A. Wang+, EMNLP2020] Wang, Yu-An and Chen, Yun-Nung “What Do Position
Embeddings Learn? An Empirical Study of Pre-Trained Language Model Positional
Encoding.” Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (2021).
40. 参考文献
40
参考文献
[J. Li+, EMNLP2018] Li, Jian, et al. “Multi-Head Attention with Disagreement
Regularization.” Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing (2018).
[P.Y. Huang+, EMNLP2019] Huang, Po-Yao, et al. “Multi-Head Attention with
Diversity for Learning Grounded Multilingual Multimodal Representations.”
Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language
Processing (2019).
[M. Geva+,arXiv2021] Geva, Mor, et al. “Transformer Feed-Forward Layers Are Key-
Value Memories.” arXiv preprint arXiv:2012.14913 (2020).
52. 参考文献
52
参考文献
[A. Vaswani+, NIPS2017] Vaswani, Ashish et al. “Attention is All you Need.”
Advances in Neural Information Processing Systems 30 (2017).
[J. Devlin+, NAACL2019] Devlin, Jacob, et al. “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding.” Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
(2019).
[N. Chen+, arXiv2019] Chen, Nanxin, et al. “Listen and Fill in the Missing Letters:
Non-Autoregressive Transformer for Speech Recognition.” arXiv preprint
arXiv:1911.04908 (2019).
[J. Gu+, NeurIPS2019] Gu, Jiatao, et al. “Levenshtein Transformer” Advances in
Neural Information Processing Systems 32 (2019).
[Y. Liu+, arXiv2019] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining
approach." arXiv preprint arXiv:1907.11692 (2019).
68. 参考文献
68
参考文献
[A. Dosovitskiy+, ICLR2021] Dosovitskiy, Alexey, et al. “An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale.” Proceedings of International
Conference on Learning Representations (2021).
[H.Touvron+, Arxiv2021] Touvron, Hugo, et al. "Training data-efficient image
transformers & distillation through attention." arXiv preprint arXiv:2012.12877(2020).
[N Carion+, ECCV2020] Carion, Nicolas et al. “End-to-End Object Detection with
Transformers.” European Conference on Computer Vision (2020).
[R Girdhar+, CVPR2019] Girdhar, Rohit et al. “Video Action Transformer Network.”
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2019).
[G. Bertasius+, arXiv2021] Bertasius, Gedas, et al. “Is Space-Time Attention All You
Need for Video Understanding?” arXiv preprint arXiv:2102.05095 (2021).
[A. Arnab+, arXiv2021] Arnab, Anurag et al. “ViViT: A Video Vision Transforme.r”
arXiv preprint arXiv:2103.15691 (2021).
[K. Lin+, CVPR2021] Lin, Kevin, et al. “End-to-End Human Pose and Mesh
Reconstruction with Transformers.” arXiv preprint arXiv:2012.09760 (2021).
[N. Wang+, CVPR2021] Wang, Ning et al. “Transformer Meets Tracker: Exploiting
Temporal Context for Robust Visual Tracking.” arXiv preprint
arXiv:2103.11681(2021).
69. 参考文献
69
参考文献
[S. Zheng+, CVPR2021] Zheng, Sixiao et al. “Rethinking Semantic Segmentation
from a Sequence-to-Sequence Perspective with Transformers.” arXiv preprint
arXiv:2012.15840 (2021).
[H. Wang+, CVPR2021] Wang, Huiyu, et al. “MaX-DeepLab: End-to-End Panoptic
Segmentation with Mask Transformers.” arXiv preprint arXiv:2012.00759 (2021).
Mingfei Chen et al. “Reformulating HOI Detection as Adaptive Set Prediction.” arXiv
preprint arXiv:2103.05983 (2021).
[M. Chen+, CVPR2021] Wang, Qianqian et al. “IBRNet: Learning Multi-View Image-
Based Rendering.” arXiv preprint arXiv:2102.13090(2021).
[D.M. Arroyo+, CVPR2021] Arroyo, Diego Martin, Janis Postels, and Federico
Tombari “Variational Transformer Networks for Layout Generation.” arXiv preprint
arXiv:2104.02416 (2021).
[K. Nakashima+, arXiv2021] Nakashima, Kodai et al. “Can Vision Transformers Learn
without Natural Images?” arXiv preprint arXiv:2103.13023 (2021).
84. 事前学習には何を使うべき?
84
Language
• BERT
• RoBERTa
• XLNet
5 Transformerを
使いこなすために
Vision
• ViT
• ResNet
(まだ特徴抽出器として部分的に使うこ
とが多い)
モデル
データセット
Language
• WebText (GPT-X)
• WMT (翻訳)
• etc.
Vision
• ImageNet
• OpenImages
• JFT300M,IG3.5B
→非公開,事前学習モデルのみ
Languageではタスク毎にデータが必要そう
85. 事前学習には何を使うべき?
85
BERT (Language) [J. Devlin+, NAACL2019]
(https://www.aclweb.org/anthology/N19-1423/)
事前学習:学習データから教師なしで文脈を考
慮した埋め込み表現を獲得する手法
• Masked Language Model (MLM)
穴が空いた文章から穴の部分の単語(token)
を予測
• Next Sentence Prediction (NSP)
2文が渡されて,それが連結するかどうかを
予測
5 Transformerを
使いこなすために
86. 事前学習には何を使うべき?
86
BERTのMasked Language Model
入力の15%のtoken(単語やsubword)を置き換え候
補として選択.そのうち,
• 80%: [Mask] tokenに置き換え
– The [Mask] is walking
• 10%: 別のランダムな単語に置き換え
– The car is walking
• 10%: そのまま
– The dog is walking
*Whole Word Masking: subword区切りの分割
で単語の一部分をmaskingするのではなく,単語全
体をmaskingすることで精度が向上
5 Transformerを
使いこなすために
元単語: philammon (phil ##am ##mon)
従来: phil [Mask] ##mon → 提案: [Mask] [Mask]
87. 事前学習には何を使うべき?
87
RoBERTa (Language) [Y. Liu+, arXiv2019]
(https://arxiv.org/abs/1907.11692)
より事前学習のデータを増やし,BERTを改良
いたモデル
事前学習
• Masked Language Model (MLM)
BERTの静的なMaskingと異なり,エポック
ごとに異なる動的なMaskingを作成し学習
• Next Sentence Prediction (NSP)の効果が
ほとんどないことを明らかにし,MLMのみ
で事前学習
5 Transformerを
使いこなすために
122. 参考文献
122
[J. Clark+, arXiv2021] Clark, Jonathan H., et al. "CANINE: Pre-training an Efficient
Tokenization-Free Encoder for Language Representation." arXiv preprint
arXiv:2103.06874 (2021).
[J. Lee+, Bioinformatics2020] Lee, Jinhyuk, et al. "BioBERT: a pre-trained biomedical
language representation model for biomedical text mining." Bioinformatics 36.4
(2020): 1234-1240.
[A. Cohan+, ACL2020] Cohan, Arman, et al. "Specter: Document-level representation
learning using citation-informed transformers." Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics. 2020.
[A. Dosovitskiy+, ICLR2021] Alexey Dosovitskiy et al. “An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale.” Proceedings of International
Conference on Learning Representations (2021).
[D.Zhang+, ECCV2020] Zhang, Dong, et al. "Feature pyramid transformer."
European Conference on Computer Vision. Springer, Cham, 2020.
[F. Yang+, CVPR2020] Yang, Fuzhi, et al. "Learning texture transformer network for
image super-resolution." Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2020.
[C. Jieneng+, arXiv2021] Chen, Jieneng, et al. "Transunet: Transformers make
strong encoders for medical image segmentation." arXiv preprint arXiv:2102.04306
(2021).
123. 参考文献
123
[K. Han+, arXiv2020] Han, Kai, et al. "A Survey on Visual Transformer." arXiv preprint
arXiv:2012.12556 (2020).
[J. Clark+, arXiv2021] Clark, Jonathan H., et al. "CANINE: Pre-training an Efficient
Tokenization-Free Encoder for Language Representation." arXiv preprint
arXiv:2103.06874 (2021).
[K. Han+, arXiv2021] Han, Kai, et al. "Transformer in transformer." arXiv preprint
arXiv:2103.00112 (2021).
[S. Khan+ arXiv2021] Khan, Salman, et al. "Transformers in Vision: A Survey." arXiv
preprint arXiv:2101.01169 (2021).
[H. Touvron+ arXiv2020] Han, Kai, et al. "Transformer in transformer." arXiv preprint
arXiv:2103.00112 (2021).
[J. Devlin+, NAACL2019] Devlin, Jacob, et al. "BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding." Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
(2019).
[Y. Liu+, arXiv2019] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining
approach." arXiv preprint arXiv:1907.11692 (2019)
124. 参考文献
124
[Z. Yang+, NeurIPS2019] Yang, Zhilin, et al. "XLNet: Generalized Autoregressive
Pretraining for Language Understanding." Advances in Neural Information
Processing Systems 32 (2019): 5753-5763.
[Y. Chen+, ECCV2020] Chen, Yen-Chun, et al. "Uniter: Universal image-text
representation learning." European Conference on Computer Vision. Springer,
Cham, 2020.
[J. Lu+, arXiv2019] Lu, Jiasen, et al. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." arXiv preprint arXiv:1908.02265
(2019).
[G. Zhe+, NeurIPS2020] Gan, Zhe, et al. "Large-scale adversarial training for vision-
and-language representation learning." arXiv preprint arXiv:2006.06195 (2020).
[P. Michel+, arXiv2019] Michel, Paul, Omer Levy, and Graham Neubig. "Are sixteen
heads really better than one?." arXiv preprint arXiv:1905.10650 (2019).
[E. Voita+, ACL2019] Voita, Elena, et al. "Analyzing Multi-Head Self-Attention:
Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned." Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[K. Clark+, ACLworkshop2019] Clark, Kevin, et al. "What Does BERT Look at? An
Analysis of BERT’s Attention." Proceedings of the 2019 ACL Workshop
BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. (2019).
125. 参考文献
125
[D. Zhou+, arXiv2021] Zhou, Daquan, et al. "DeepViT: Towards Deeper Vision
Transformer." arXiv preprint arXiv:2103.11886 (2021).
[J. Li+, EMNLP2018] Li, Jian, et al. "Multi-Head Attention with Disagreement
Regularization." Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing. (2018).
[P. Huang+, EMNLP-IJCNLP2019] Huang, Po-Yao, Xiaojun Chang, and Alexander
G. Hauptmann. "Multi-Head Attention with Diversity for Learning Grounded
Multilingual Multimodal Representations." Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP). (2019).
[Z. Dai+, ACL2019] Dai, Zihang, et al. "Transformer-XL: Attentive Language Models
beyond a Fixed-Length Context." Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. (2019).
[J. Rae+, ICLR2019] Rae, Jack W., et al. "Compressive Transformers for Long-
Range Sequence Modelling." International Conference on Learning Representations.
(2019).
[R. Child+, arXiv2019] Child, Rewon, et al. "Generating long sequences with sparse
transformers." arXiv preprint arXiv:1904.10509 (2019).
126. 参考文献
126
[I. Beltagy+, arXiv2020] Beltagy, Iz, Matthew E. Peters, and Arman Cohan.
"Longformer: The long-document transformer." arXiv preprint arXiv:2004.05150
(2020).
[M. Zaheer+, arXiv2020] Zaheer, Manzil, et al. "Big bird: Transformers for longer
sequences." arXiv preprint arXiv:2007.14062 (2020).
[N. Kitaev+, ICLR2020] Kitaev, Nikita, Lukasz Kaiser, and Anselm Levskaya.
"Reformer: The Efficient Transformer." International Conference on Learning
Representations. (2020).
[J. Ainslie+, EMNLP2020] Ainslie, Joshua, et al. "ETC: Encoding Long and Structured
Inputs in Transformers." Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP). (2020).
[A. Roy+, TACL2021] Roy, Aurko, et al. "Efficient content-based sparse attention with
routing transformers." Transactions of the Association for Computational Linguistics 9
(2021): 53-68.
[H. Zhou+, AAAI2021] Zhou, Haoyi, et al. "Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting." arXiv preprint arXiv:2012.07436 (2020).
[S. Li+, NeurIPS2019] Li, Shiyang, et al. "Enhancing the Locality and Breaking the
Memory Bottleneck of Transformer on Time Series Forecasting." Advances in Neural
Information Processing Systems 32 (2019): 5243-5253
127. 参考文献
127
[S. Wang+, arXiv2020] Wang, Sinong, et al. "Linformer: Self-attention with linear
complexity." arXiv preprint arXiv:2006.04768 (2020).
[V. Nguyen+, ECCV2020] Nguyen, V. Q., Suganuma, M., & Okatani, T. (2020).
Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions
Between Multiple Inputs. In ECCV 2020 - 16th European Conference, (2020),
Proceedings (pp. 223-240).
[S. Wu+, NeurIPS2020] Wu, Sifan, et al. "Adversarial Sparse Transformer for Time
Series Forecasting." Advances in Neural Information Processing Systems 33 (2020).
[B. Peters+, ACL2019] Peters, Ben, Vlad Niculae, and André FT Martins. "Sparse
Sequence-to-Sequence Models." Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. (2019).
[Q. Wang+, ICLR2019] Baevski, Alexei, and Michael Auli. "Adaptive Input
Representations for Neural Language Modeling." International Conference on
Learning Representations. (2019).
[A. Baevski+, ACL2019] Wang, Qiang, et al. "Learning Deep Transformer Models for
Machine Translation." Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. (2019).
128. 参考文献
128
[T. Nguyen+, arXiv2019] Nguyen, Toan Q., and Julian Salazar. "Transformers without
tears: Improving the normalization of self-attention." arXiv preprint arXiv:1910.05895
(2019).
[R. Xiong+, ICML2020] Xiong, Ruibin, et al. "On layer normalization in the
transformer architecture." International Conference on Machine Learning. PMLR,
(2020).
[L. Liu+, EMNLP2020] Liu, Liyuan, et al. "Understanding the Difficulty of Training
Transformers." Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP). (2020).
P. Martin+, 2018] Popel, Martin, and Ondřej Bojar. "Training Tips for the Transformer
Model." The Prague Bulletin of Mathematical Linguistics 110 (2018): 43-70.
[N. Sharan+, arXiv2021] Narang, Sharan, et al. "Do Transformer Modifications
Transfer Across Implementations and Applications?." arXiv preprint
arXiv:2102.11972 (2021).
[Y.Yang+, ICLR2019] You, Yang, et al. "Large Batch Optimization for Deep Learning:
Training BERT in 76 minutes." International Conference on Learning
Representations. 2019.
[S. Merity, arXiv2019] Merity, Stephen. "Single headed attention rnn: Stop thinking
with your head." arXiv preprint arXiv:1911.11423 (2019).
129. 参考文献
129
[I. Bello, ICLR2021] Bello, Irwan. "LambdaNetworks: Modeling long-range
Interactions without Attention." International Conference on Learning
Representations. 2021.
[A. Jaegle, arXiv2021] Jaegle, Andrew, et al. "Perceiver: General Perception with
Iterative Attention." arXiv preprint arXiv:2103.03206 (2021).
[K. Lu, arXiv2021] Lu, Kevin, et al. "Pretrained transformers as universal computation
engines." arXiv preprint arXiv:2103.05247 (2021).
138. 参考文献
138
[P. Anderson+ CVPR2018] Anderson, Peter, et al. "Bottom-up and top-down
attention for image captioning and visual question answering." Proceedings of the
IEEE conference on computer vision and pattern recognition. (2018).
[W. Su+ ICLR2020] Su, Weijie, et al. "VL-BERT: Pre-training of Generic Visual-
Linguistic Representations." International Conference on Learning Representations.
(2020).
[Y.C. Chen+ ECCV2020] Chen, Yen-Chun, et al. "Uniter: Universal image-text
representation learning." European Conference on Computer Vision. Springer,
Cham, (2020).
[H. Tan+ EMNLP-IJCNLP2019] Tan, Hao, and Mohit Bansal. "LXMERT: Learning
Cross-Modality Encoder Representations from Transformers." Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
(2019).
[J. Lu+, arXiv2019] Lu, Jiasen, et al. "Vilbert: Pretraining task-agnostic visiolinguistic
representations for vision-and-language tasks." arXiv preprint arXiv:1908.02265
(2019).
139. 参考文献
139
[J. Lu+ CVPR2020] Lu, Jiasen, et al. "12-in-1: Multi-task vision and language
representation learning." Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. (2020).
[A. Majumdar+ ECCV2020] Majumdar, Arjun, et al. "Improving vision-and-language
navigation with image-text pairs from the web." European Conference on Computer
Vision. Springer, Cham, (2020).
146. Image captioning: After Transformer
146
Image transformer [S. He+, ACCV2020]
(https://arxiv.org/abs/2004.14231)
• 領域同士の関係を明示的に取り入れるために,画像
の領域同士の重なり具合に応じて重みを変更する3
つのゲート (child, parent, neighborhood)を
Transformer内部に導入
7 Vision and Language
の近年の動向
147. Image captioning: After Transformer
147
DLCT [Y. Luo+, arXiv2021] (https://arxiv.org/abs/2101.06462)
• image regionによる入力と,従来のgrid featureによ
る入力の2つの良い所を組み合わせて入力
• 絶対的な位置情報と相対的な位置情報の両方を統合す
ることで、入力された特徴の間の複雑な視覚と位置関
係をモデル化
7 Vision and Language
の近年の動向
• 特徴同士のアライ
ンメントを導くた
めcross-attention
fusionsを活用
148. Image captioning: After Transformer
148
画像とテキストの結びつけ
MIA module [F. Liu+, NeurIPS2019]
(https://arxiv.org/abs/1905.06139)
Concept-guided Attention [J. Li+, Applied Science 2019]
(https://www.mdpi.com/2076-3417/9/16/3260)
7 Vision and Language
の近年の動向
画像とobjectのラベル情報を入力としたmutual
attentionを積み重ね,画像の領域情報とラベル
情報のalignmentを取得するモジュールを提案
画像領域のattentionとobjectのラ
ベルに対するattentionをかけ合わ
せるモジュールを提案
149. Image captioning: After Transformer
149
画像とテキストの結びつけ
EnTangled Attention [G. Li+, ICCV2019]
(https://openaccess.thecvf.com/content_ICCV_2019/html/Li_Entangled_Transfor
mer_for_Image_Captioning_ICCV_2019_paper.html)
7 Vision and Language
の近年の動向
画像の特徴量と画像のオブジェクトのattributeの特徴量を
Co-attentionと似た手法で相互に組み合わせる手法を提案
150. Image captioning: After Transformer
150
Blinding userに向けて
Assessing Image Quality Issues for Real-
World Problems [TY. Chiu+, CVPR2020]
(https://arxiv.org/abs/2003.12511)
• Blinding userが撮影した写真に説明のためのキャプ
ションをつける時に,どのような課題があるか調査
– 提供される画像の約15%が認識できない
– 画像にボケやフレーミングがあるなど
7 Vision and Language
の近年の動向
151. Image captioning: After Transformer
151
Blinding userに向けて
Captioning Images Taken by People Who
Are Blind [D. Gurari+, ECCV2020]
(https://link.springer.com/chapter/10.1007/978-3-030-58520-4_25)
• Blinding userのためのデータセット:VizWizを公開
• データセット内の画像が多様であることを示す
7 Vision and Language
の近年の動向
152. Image captioning: After Transformer
152
因果推論の枠組みの導入
Deconfounded image captioning
[X. Yanf+, arXiv2020] (https://arxiv.org/abs/2003.03923)
• Image Captioningにおける偏りの因果関係につい
て調査するためのフレームワークを提案し,調査し
たところpre-training用のデータセットに因果関係
が存在することを明らかに
7 Vision and Language
の近年の動向
153. Image captioning: After Transformer
153
評価指標の見直し
Novel metric based on BERTScore [Y. Yi+,
ACL2020] (https://www.aclweb.org/anthology/2020.acl-main.93/)
• Image Captioningの新たな評価指標を提案
• 評価時,正解となるキャプションの中での分散を考
慮しないため,生成されたキャプションと参照文と
のミスマッチに強くペナルティを与えてしまう傾向
• BERTScoreをBaseと
した新たな手法を提案
7 Vision and Language
の近年の動向
164. Visual Language Navigation (VLN)
164
2020~ VLN BERT からの Transformer化か
Before Transformer
• Embodied Question Answering
(EQA) [E. Wijmans+, CVPR2019]
(https://arxiv.org/abs/1904.03461)
– D. Batra先生@GeogiaTechのチーム
– 入力:画像,出力:前後左右 のLSTM
After Transformer
• VLN BERT[A. Majumdar+, ECCV2020]
– EQAのD. Batra先生@GeogiaTechのチーム
– 同チームのViL BERT[NeurIPS2019]を継承
• Recurrent VLN BERT[Y. Hong+, CVRP2021]
7 Vision and Language
の近年の動向
抜粋引用:EQA原著 Fig4
出典:VLNBERT原著 Fig3
165. Referring Expression Comprehension
165
訓練済みV&Lモデルによる飛躍的な性能向上
Before Transformer
• 大きく分けて2種類の手法
– 2ステップ:①候補生成→②ランク付け
– 1ステップ:画像から直接候補領域を抽出
After Transformer
• 多くの大規模訓練モデルで実験が行われ,いずれもTransformer
以前のモデルの性能を大きく上回る
– ViLBERT, VLBERT, UNITER, LXMERT, VILLA, etc.
• ただしこれらは全て2ステップ手法=②にのみ
Transformerが使われている
• このタスクに特化したTransformerモデルは(おそらく)
まだない
7 Vision and Language
の近年の動向
166. Referring Expression Comprehension
166
頑健性を高める研究
Words aren't enough, their order matters: On the
Robustness of Grounding Visual Referring
Expressions [A. Akula+, ACL2020]
(https://www.aclweb.org/anthology/2020.acl-
main.586/)
• Referring Expressionのバイアスについて調べた論文
• ViLBERTの頑健性を高めるための学習手法についての提案
がある
– 距離学習
– マルチタスク学習
7 Vision and Language
の近年の動向
167. 参考文献
167
[X. Yang+, 2020] Yang, Xu, Hanwang Zhang, and Jianfei Cai. "Deconfounded image
captioning: A causal retrospect." arXiv preprint arXiv:2003.03923 (2020).
[P. Anderson+, CVPR2018] Anderson, Peter, et al. "Bottom-up and top-down
attention for image captioning and visual question answering." Proceedings of the
IEEE conference on computer vision and pattern recognition. (2018).
[L. Huang+, ICCV2019] Huang, Lun, et al. "Attention on attention for image
captioning." Proceedings of the IEEE/CVF International Conference on Computer
Vision. (2019).
[M. Cornia+, CVPR2020] Cornia, Marcella, et al. "Meshed-memory transformer for
image captioning." Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. (2020).
[S. He+, ACCV2020] He, Sen, et al. "Image captioning through image transformer."
Proceedings of the Asian Conference on Computer Vision. (2020).
[Y. Luo+, arXiv2021] Luo, Yunpeng, et al. "Dual-Level Collaborative Transformer for
Image Captioning." arXiv preprint arXiv:2101.06462 (2021).
[F. Liu+, NeurIPS2019] Liu, Fenglin, et al. "Aligning Visual Regions and Textual
Concepts for Semantic-Grounded Image Representations." NeurIPS. (2019).
[J. Li+, Applied Science2019] Li, Jiangyun, et al. "Boosted transformer for image
captioning." Applied Sciences 9.16 (2019): 3260.
168. 参考文献
168
[G. Li+, ICCV2019] Li, Guang, et al. "Entangled transformer for image captioning."
Proceedings of the IEEE/CVF International Conference on Computer Vision. (2019).
[TY. Chiu+, CVPR2020] Chiu, Tai-Yin, Yinan Zhao, and Danna Gurari. "Assessing
image quality issues for real-world problems." Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. (2020).
[D. Gurari+, ECCV2020] Gurari, Danna, et al. "Captioning images taken by people
who are blind." European Conference on Computer Vision. Springer, Cham, (2020).
[Y. Yi+, ACL2020] Yi, Yanzhi, Hangyu Deng, and Jinglu Hu. "Improving image
captioning evaluation by considering inter references variance." Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics. (2020).
[P. Zhang+, CVPR2021] Zhang, Pengchuan, et al. "VinVL: Revisiting Visual
Representations in Vision-Language Models." arXiv preprint arXiv:2101.00529
(2021).
[X. Li+, ECCV2020] Li, Xiujun, et al. "Oscar: Object-semantics aligned pre-training for
vision-language tasks." European Conference on Computer Vision. Springer, Cham,
(2020).
[X. Hu+, arXiv2020] Hu, Xiaowei, et al. "Vivo: Surpassing human performance in
novel object captioning with visual vocabulary pre-training." arXiv preprint
arXiv:2009.13682 (2020).
169. 参考文献
169
[R. Rombach+, NeurIPS2020] Rombach, Robin, Patrick Esser, and Bjorn Ommer.
"Network-to-Network Translation with Conditional Invertible Neural Networks."
Advances in Neural Information Processing Systems 33 (2020).
[A. Radford+, arXiv2021] Radford, Alec, et al. "Learning transferable visual models
from natural language supervision." arXiv preprint arXiv:2103.00020 (2021).
[D. Bau+, arXiv2021] Bau, David, et al. "Paint by Word." arXiv preprint
arXiv:2103.10951 (2021).
[F. Galatolo+, arXiv2021] Galatolo, Federico A., Mario GCA Cimino, and Gigliola
Vaglini. "Generating images from caption and vice versa via CLIP-Guided Generative
Latent Space Search." arXiv preprint arXiv:2102.01645 (2021).
[A. Ramesh+, arXiv2021] Ramesh, Aditya, et al. "Zero-shot text-to-image
generation." arXiv preprint arXiv:2102.12092 (2021).
[V. Murahari+, ECCV2020] Murahari, Vishvak, et al. "Large-scale pretraining for
visual dialog: A simple state-of-the-art baseline." European Conference on Computer
Vision. Springer, Cham, (2020).
[K. Shuster+, ACL2020] Shuster, Kurt, et al. "The Dialogue Dodecathlon: Open-
Domain Knowledge and Image Grounded Conversational Agents." Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics. (2020).
170. 参考文献
170
[M. Cogswell+, NeurIPS2020] Cogswell, Michael, et al. "Dialog without Dialog Data:
Learning Visual Dialog Agents from VQA Data." Advances in Neural Information
Processing Systems 33 (2020).
[Y. Zhang+, ACL2020] Zhang, Yizhe, et al. "DIALOGPT: Large-Scale Generative
Pre-training for Conversational Response Generation." Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics: System
Demonstrations.(2020).
[VQ. Nguuyen+, ECCV2020] Nguyen, Van-Quang, Masanori Suganuma, and
Takayuki Okatani. "Efficient Attention Mechanism for Visual Dialog that can Handle
All the Interactions between Multiple Inputs." (2020)
[E. Wijmans+, CVPR2019] Wijmans, Erik, et al. "Embodied question answering in
photorealistic environments with point cloud perception." Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2019).
[A. Majumdar+, ECCV2020] Majumdar, Arjun, et al. "Improving vision-and-language
navigation with image-text pairs from the web." European Conference on Computer
Vision. Springer, Cham, (2020).
[Y. Hong+, CVRP2021] Hong, Yicong, et al. "A Recurrent Vision-and-Language
BERT for Navigation." arXiv preprint arXiv:2011.13922 (2020).
171. 参考文献
171
[A. Akula+, ACL2020] Akula, Arjun, et al. "Words Aren’t Enough, Their Order Matters:
On the Robustness of Grounding Visual Referring Expressions." Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics. (2020).