è«æ解説 Attention Is All You Need (Transformer)
ããã«ã¡ã¯ Ryobot (ããã¼ã£ã¨) ã§ãï¼
æ¬ç´ã¯ RNN ã CNN ã使ãã Attention ã®ã¿ä½¿ç¨ãããã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ Transformer ãææ¡ãã¦ããï¼
ããããªè¨ç·´ã§å§åç㪠State-of-the-Art ãéæãï¼è¯éºã«ã¿ã¤ãã«ååããï¼
ã¾ã注æãé常ã«ã·ã³ãã«ãªæ°å¼ã«ä¸è¬åããããã§ï¼å æ³æ³¨æã»å ç©æ³¨æã»ã½ã¼ã¹ã¿ã¼ã²ãã注æã»èªå·±æ³¨æã«åé¡ããï¼ãã®ãã¡èªå·±æ³¨æã¯ããªãæ±ç¨çãã¤å¼·åãªææ³ã§ããä»ã®ãããããã¥ã¼ã©ã«ãããã«è»¢ç¨ã§ããï¼
WMT'14 ã® BLEU ã¹ã³ã¢ã¯è±ä»: 41.0, è±ç¬: 28.4 ã§ç¬¬ 1 ä½
- Attention Is All You Need [Åukasz Kaiser et al., arXiv, 2017/06]
- Transformer: A Novel Neural Network Architecture for Language Understanding [Project Page]
- TensorFlow (èè ã)
- Chainer
- PyTorch
å·¦å´ãã¨ã³ã³ã¼ãï¼å³å´ããã³ã¼ãã§ããï¼ããããç°è²ã®ãããã¯ã 6 åã¹ã¿ãã¯ãã¦ãã ()ï¼
- ã¨ã³ã³ã¼ã: [èªå·±æ³¨æ, ä½ç½®æ¯ã® FFN] ã®ãããã¯ã 6 層ã¹ã¿ãã¯
- ãã³ã¼ã: [(ãã¹ãã³ã°ä»ã) èªå·±æ³¨æ, ã½ã¼ã¹ã¿ã¼ã²ãã注æ, ä½ç½®æ¯ã® FFN] ã®ãããã¯ã 6 層ã¹ã¿ãã¯
ãããã¯å ã¯æ®å·®æ¥ç¶ (Residual Connection) ã¨å±¤æ£è¦å (Layer Normalization) ãé©å¿ããï¼
Transformer ã®è©³ç´°ã«å ¥ãåã«ï¼æ³¨æã«ã¤ãã¦åèãããï¼
注æã¯è¾æ¸ãªãã¸ã§ã¯ã
ä¸è¬ç㪠Encoder-Decoder ã®æ³¨æã¯ã¨ã³ã³ã¼ãã®é ã層ã ï¼ãã³ã¼ãã®é ã層ã ã¨ãã¦æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ããä¸è¬åãã㨠ã (æ¤ç´¢ã¯ã¨ãª) ã¨è¦åãï¼ ã 㨠ã«åé¢ããï¼
ãã以éï¼ã¤ãã·ã£ã«ãå°æåã® (ãªãã ) ã¯ãã¯ãã«ï¼å¤§æåã® (ãªãã ) ã¯è¡å (ãã¯ãã«ã®é å) ã表ãï¼
ãã®æ 㨠ã¯å ã¨å ãä¸å¯¾ä¸å¯¾å¿ãã key-value ãã¢ã®é åï¼ã¤ã¾ãè¾æ¸ãªãã¸ã§ã¯ãã¨ãã¦æ©è½ããï¼
㨠ã®å ç©ã¯ ã¨å ã®é¡ä¼¼åº¦ã測ãï¼softmax ã§æ£è¦åãã注æã®éã¿ (Attention Weight) 㯠ã«ä¸è´ãã ã®ä½ç½®ã表ç¾ããï¼æ³¨æã®éã¿ã¨ ã®å ç©ã¯ ã®ä½ç½®ã«å¯¾å¿ãã ãå éåã¨ãã¦åãåºãæä½ã§ããï¼
ã¤ã¾ã注æã¨ã¯ (æ¤ç´¢ã¯ã¨ãª) ã«ä¸è´ãã ãç´¢å¼ãï¼å¯¾å¿ãã ãåãåºãæä½ã§ããï¼ããã¯è¾æ¸ãªãã¸ã§ã¯ãã®æ©è½ã¨åãã§ããï¼ä¾ãã°ä¸è¬ç㪠Encoder-Decoder ã®æ³¨æã¯ï¼ã¨ã³ã³ã¼ãã®ãã¹ã¦ã®é ã層 (æ å ±æº) ãã ã«é¢é£ããé ã層 (æ å ±) ã注æã®éã¿ã®å éåã¨ãã¦åãåºããã¨ã§ããï¼
ã®é å ãä¸ããããã°ï¼ãã®æ°ã ã key-value ãã¢ã®é åãã ãåãåºãï¼
Memory ã Key 㨠Value ã«åé¢ããæå³
key-value ãã¢ã®é åã®ååºã¯ End-To-End Memory Network [Sukhbaatar, 2015] ã§ãããï¼ ã ï¼ ã (両æ¹ãåãã㦠) ã¨è¡¨è¨ãã¦ããï¼è¾æ¸ãªãã¸ã§ã¯ãã¨ããèªèã¯ãªãã£ãï¼
åãã¦è¾æ¸ãªãã¸ã§ã¯ãã¨èªèãããã®ã¯ Key-Value Memory Networks [Miller, 2016] ã§ããï¼
- End-To-End Memory Networks [Sainbayar Sukhbaatar, sec-last: Jason Weston, arXiv, NIPS, 2015/03]
- Key-Value Memory Networks for Directly Reading Documents [Alexander Miller, last: Jason Weston, arXiv, 2016/06]
Key-Value Memory Networks ã§ã¯ key-value ãã¢ãæè (e.g. ç¥èãã¼ã¹ãæç®) ãè¨æ¶ã¨ãã¦æ ¼ç´ããä¸è¬çãªææ³ã ã¨èª¬æãã¦ããï¼ ã 㨠ã«åé¢ãããã¨ã§ 㨠éã®éèªæãªå¤æã«ãã£ã¦é«ã表ç¾åãå¾ãããã¨ããï¼ããã§ããéèªæãªå¤æã¨ã¯ï¼ä¾ãã°ã ãå ¥åã㦠ãäºæ¸¬ããå¦ç¿å¨ãã容æã«ã¯ä½ããªãç¨åº¦ã«è¤é㪠(äºæ¸¬ä¸å¯è½ãª) å¤æã¨ããæå³ã§ããï¼
ãã®å¾ï¼è¨èªã¢ãã«ã§ãåãèªèã®ææ³ [Daniluk, 2017] ãææ¡ããã¦ããï¼
- Frustratingly Short Attention Spans in Neural Language Modeling [MichaÅ Daniluk, arXiv, ICLR, 2017/02]
å æ³æ³¨æã¨å ç©æ³¨æ
注æã¯ã注æã®éã¿ã®æ±ãæ¹ãã«ãã£ã¦å æ³æ³¨æã¨å ç©æ³¨æã«åºåããã.
å æ³æ³¨æ (Additive Attention) [Bahdanau, 2014] ã¯æ³¨æã®éã¿ãé ã層 1 ã¤ã®ãã£ã¼ããã©ã¯ã¼ããããã¯ã¼ã¯ã§æ±ããï¼
å ç©æ³¨æ (Dot-Product Attention, Multiplicative Attention) [Luong, 2015] ã¯æ³¨æã®éã¿ãå ç©ã§æ±ããï¼ä¸è¬ã«å ç©æ³¨æã®æ¹ããã©ã¡ã¼ã¿ãå¿ è¦ãªã (æ ã«ã¡ã¢ãªå¹çãè¯ã) é«éã§ããï¼Transformer ã¯ãã¡ãã使ãï¼
- Neural Machine Translation by Jointly Learning to Align and Translate [Dzmitry Bahdanau, sec: Kyunghyun Cho, last: Yoshua Bengio, ICLR 2015, arXiv, 2014/09]
- Effective Approaches to Attention-based Neural Machine Translation [Minh-Thang Luong, arXiv, 2015/08]
ã½ã¼ã¹ã¿ã¼ã²ãã注æã¨èªå·±æ³¨æ
注æã¯ãå ¥åãã©ãããæ¥ãã®ããã«ãã£ã¦ã½ã¼ã¹ã¿ã¼ã²ãã注æã¨èªå·±æ³¨æã«åºåãããï¼
ã½ã¼ã¹ã¿ã¼ã²ãã注æ (Source-Target-Attention) ã§ã¯ 㨠ã¯ã¨ã³ã³ã¼ãã®é ã層 () ããæ¥ã¦ï¼ ã¯ãã³ã¼ãã®é ã層 () ããæ¥ãï¼ä¸è¬ç㪠Encoder-Decoder ã®æ³¨æã¯ãã¡ãã§ããï¼åè¿°ã®ã¨ãã 㯠ã¨ãå¼ã°ãï¼ ã¨ ã¯ ã 2 ã¤ã«åé¢ãããã®ã¨è§£éã§ããï¼
èªå·±æ³¨æ (Self-Attention) ã§ã¯ ã¯å ¨ã¦åãå ´æ () ããæ¥ãï¼ä¾ãã°ã¨ã³ã³ã¼ãã® ã¯ãã¹ã¦ä¸ã®é ã層ããæ¥ãï¼
èªå·±æ³¨æã¯ããä½ç½®ã®åºåãæ±ããã®ã«ä¸ã®é ã層ã®å ¨ã¦ã®ä½ç½®ãåç §ã§ããï¼ããã¯å±æçãªä½ç½®ããåç §ã§ããªãç³ã¿è¾¼ã¿å±¤ããåªããå©ç¹ã§ããï¼
å¾æ¥ã®æ³¨æã¢ãã«ã§ã¯ 1 度ã®æ³¨æã« 1 ã¤ã® ããä¸ããããªã (e.g. RNNsearch, MemN2N)ï¼ãããï¼ãã³ã¼ãã® ãåæã«ä¸ããããå ´åï¼ããã㯠ãåæã«ä¸ããããèªå·±æ³¨æã§ã¯ï¼ããããåæã«æ³¨æãå®è¡ãï¼ ã¨åæ°ã®åºåãã¯ãã«ãã¾ã¨ãã¦å¾ãããï¼
Transformer
ã¢ãã«ã¯è³ã£ã¦ã·ã³ãã«ã§ããï¼
- ã¨ã³ã³ã¼ã: [èªå·±æ³¨æ, ä½ç½®æ¯ã® FFN] ã®ãããã¯ã 6 層ã¹ã¿ãã¯
- ãã³ã¼ã: [(ãã¹ãã³ã°ä»ã) èªå·±æ³¨æ, ã½ã¼ã¹ã¿ã¼ã²ãã注æ, ä½ç½®æ¯ã® FFN] ã®ãããã¯ã 6 層ã¹ã¿ãã¯
ãããã¯ã¼ã¯å ã®ç¹å¾´è¡¨ç¾ã¯ [åèªåã®é·ã x ååèªã®æ¬¡å æ°] ã®è¡åã§è¡¨ãããï¼æ³¨æã®å±¤ãé¤ã㦠0 éã®ååèªã¯ãããå¦ç¿ã®åæ¨æ¬ã®ããã«ç¬ç«ãã¦å¦çãããï¼
è¨ç·´æã®ãã³ã¼ãã¯èªå·±å帰ã使ç¨ããï¼å ¨ã¿ã¼ã²ããåèªãåæã«å ¥åï¼å ¨ã¿ã¼ã²ããåèªãåæã«äºæ¸¬ããï¼ãã ãäºæ¸¬ãã¹ãã¿ã¼ã²ããåèªã®æ å ±ãäºæ¸¬åã®ãã³ã¼ãã«ãªã¼ã¯ããªãããã«èªå·±æ³¨æã«ãã¹ã¯ãããã¦ãã (ie, Masked Decoder)ï¼è©ä¾¡/æ¨è«æã¯èªå·±å帰ã§åèªåãçæããï¼
縮å°ä»ãå ç©æ³¨æ
Transformer ã§ã¯å ç©æ³¨æã縮å°ä»ãå ç©æ³¨æ (Scaled Dot-Product Attention) ã¨å¼ç§°ããï¼é常ã®å ç©æ³¨æã¨åãã ããã¨ã« key-value ãã¢ã®é åããå éåã¨ã㦠ãåãåºãæä½ã§ãããï¼ ã¨ ã®å ç©ãã¹ã±ã¼ãªã³ã°å å ã§é¤ç®ããï¼
ã¾ãï¼ ã®é å㯠1 ã¤ã®è¡å ã«ã¾ã¨ãã¦åæã«å ç©æ³¨æãè¨ç®ãã (å¾æ¥éã 㨠ã®é åã ã«ã¾ã¨ãã)ï¼
縮å°ä»ãå ç©æ³¨æã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ãå°ããå ´åï¼ã¹ã±ã¼ãªã³ã°å åããªãã¦ãå ç©æ³¨æã¯å æ³æ³¨æã¨åæ§ã«æ©è½ããï¼ããã ã大ããå ´åï¼ã¹ã±ã¼ãªã³ã°å åããªãã¨å æ³æ³¨æã®æ¹ããã¾ãæ©è½ããï¼å ç©æ³¨æã¯å ç©ã大ãããªãããã¦ï¼éä¼æã® softmax ã®å¾é ã極端ã«å°ãããªããã¨ãåå ã§ããï¼
Scaled Dot-Product Attention | Multi-Head Attention |
---|---|
Mask (option) ã¯ãã³ã¼ãã®äºæ¸¬ãã¹ãã¿ã¼ã²ããåèªã®æ å ±ãäºæ¸¬åã®ãã³ã¼ãã¼ã«ãªã¼ã¯ããªãããã«èªå·±æ³¨æã«ããããã¹ã¯ã§ãã (Softmax ã¸ã®å ¥åã®ãã¡èªå·±å帰ã®äºæ¸¬åã®ä½ç½®ã«å¯¾å¿ããé¨åã ã§åãã)ï¼
è¤æ°ãããã®æ³¨æ
Transformer ã§ã¯ç¸®å°ä»ãå ç©æ³¨æã 1 ã¤ã®ãããã¨è¦åãï¼è¤æ°ãããã並ååããè¤æ°ãããã®æ³¨æ (Multi-Head Attention) ã使ç¨ããï¼ãããæ° ã¨åãããã®æ¬¡å æ° ã¯ãã¬ã¼ããªããªã®ã§åè¨ã®ãã©ã¡ã¼ã¿æ°ã¯ãããæ°ã«ä¾ããåä¸ã§ããï¼
次å ã® ãç¨ãã¦åä¸ã®å ç©æ³¨æãè¨ç®ãã代ããã«ï¼ ããããã åç°ãªãéã¿è¡å 㧠次å ã«ç·å½¢ååã㦠åã®å ç©æ³¨æãè¨ç®ããï¼åå ç©æ³¨æ㮠次å ã®åºåã¯é£çµ (concatenate) ãã¦éã¿è¡å 㧠次å ã«ç·å½¢ååããï¼
è¤æ°ãããã®æ³¨æã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ãã㧠ã¯ãã¹ã¦ã®å±¤ã®åºåã®æ¬¡å æ°ï¼ 㯠㨠ã®æ¬¡å æ°ï¼ 㯠ã®æ¬¡å æ°ã§ããï¼
å®é¨çã«è¤æ°ãããã®æ³¨æã®æ¹ãåä¸ã®æ³¨æããæ§è½ãé«ããã¨ãå¤æããï¼è¤æ°ãããã®æ³¨æã¯åããããç°ãªãä½ç½®ã®ç°ãªãé¨å空éãå¦çããã¨è§£éã§ãï¼åä¸ã®æ³¨æã¯å ç®ãããã妨ãã¦ãã¾ãï¼
ä½ç½®æ¯ã®ãã£ã¼ããã©ã¯ã¼ããããã¯ã¼ã¯
ä½ç½®æ¯ã®ãã£ã¼ããã©ã¯ã¼ããããã¯ã¼ã¯ (Position-wise Feed-Forward Network, FFN) ã¯ãã®åã®ã¨ããåèªåã®ä½ç½®æ¯ã«ç¬ç«ãã¦å¦çãã FFN ã§ããï¼
FFN ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ReLU ã§æ´»æ§åãã 次å ã®ä¸é層㨠次å ã®åºå層ããæã 2 層ã®å ¨çµåãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã§ããï¼
ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°
Transformer 㯠RNN ã CNN ã使ç¨ããªãã®ã§åèªåã®èªé (åèªã®ç¸å¯¾çãªãã絶対çãªä½ç½®) ã®æ å ±ã追å ããå¿ è¦ãããï¼
æ¬ææ³ã§ã¯å ¥åã®åãè¾¼ã¿è¡åã«ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã° (Positional Encoding) ã®è¡å ãè¦ç´ ãã¨ã«å ç®ããï¼
ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ã®è¡å ã®åæåã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ãã㧠ã¯åèªã®ä½ç½®ï¼ ã¯æåã®æ¬¡å ã§ããï¼ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ã®å次å ã¯æ³¢é·ã ãã ã«å¹¾ä½å¦çã«ä¼¸ã³ãæ£å¼¦æ³¢ã«å¯¾å¿ããï¼
ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ãå¯è¦åããã¨æ¬¡å³ã®ã¨ãã (ã½ã¼ã¹)ï¼
横軸ãåèªã®ä½ç½® (0 ~ 99)ï¼ç¸¦è»¸ãæåã®æ¬¡å (0 ~ 511)ï¼æ¿æ·¡ãå ç®ããå¤ (-1 ~ 1) ã§ããï¼
ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ã®ååºã¯ End-To-End Memory Network (MemN2N) ã§ããï¼è³ªçå¿çã§ã¯è¤æ°ã®å ¥åæãæ±ãããï¼å ¥åæã®æç³»åãã¨ã³ã³ã¼ãããæéã¨ã³ã³ã¼ãã£ã³ã° (Temporal Encoding) ã使ç¨ãããï¼
å®é¨
å®è£ 㯠Tensor2Tensor ã§ããï¼
ãã¼ã¿ã»ãã㯠WMT'14 ã®è±ä» (36M 対訳æ) ã¨è±ç¬ (4.5M 対訳æ) ã使ç¨ããï¼å¸å°èªã«å¯¾å¿ããããã«ã½ã¼ã¹è¨èªã¨ã¿ã¼ã²ããè¨èªã§èªå½ãå ±æãã 32000 ãã¼ã¹ã® Wordpiece (ãµãã¯ã¼ã) ã使ç¨ããï¼
8 æã® P100 GPU ã§è¨ç·´ããï¼å¾è¿°ã®åºæ¬ã¢ãã« (base) 㯠1 è¨ç·´ã¹ãããã«ç´ 0.4 ç§ãããï¼å ¨ 10 ä¸ã¹ãããã« 12 æéãããï¼ã¾ãï¼å·¨å¤§ã¢ãã« (big) 㯠1 è¨ç·´ã¹ãããã« 1 ç§ãããï¼å ¨ 30 ä¸ã¹ãããã« 3.5 æ¥ãããï¼
æé©å㯠Adam () ã使ç¨ãï¼è¨ç·´éç¨ã§å¦ç¿çãå¤åãããï¼
ãã®å¼ã«ããï¼æåã® ã¹ãããã¯å¦ç¿çãç·å½¢ã«å¢å ãï¼ãã®å¾ã¯å¦ç¿çãéå¹³æ¹æ ¹ã«æ¯ä¾ãã¦æ¸å°ããï¼ ã¯ç´ ã§ããï¼æãè¿ãæã® ã¯ç´ ã§ããï¼
è¨ç·´æ㯠3 種é¡ã®æ£è¦åã使ç¨ããï¼
- ã©ãã«å¹³æ»å (Label Smoothing): ã®ã©ãã«å¹³æ»å [Szegedy, 2015] ãé©å¿ããï¼ã¢ãã«ã¯ä¸ç¢ºããªã©ãã«ãå¦ç¿ããã®ã§ãã¼ãã¬ãã·ãã£ã¯æªåãããï¼ç²¾åº¦ã¨ BLEU ã¹ã³ã¢ã¯åä¸ããï¼
- æ®å·®ããããã¢ã¦ã (Residual Dropout): å ¥åã®å ç®ã¨å±¤æ£è¦å (Add & Norm) åã®å層ã®åºåã«ããããã¢ã¦ããé©å¿ããï¼ã¾ãï¼ã¨ã³ã³ã¼ãã¨ãã³ã¼ãããããã®åãè¾¼ã¿ã¨ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ã®åã«ãããããã¢ã¦ããé©å¿ããï¼ããããã¢ã¦ãç㯠ã使ç¨ããï¼
- 注æããããã¢ã¦ã (Attention Dropout): å ç©æ³¨æã® softmax ã®æ´»æ§åã¯ãã£ã¼ããã©ã¯ã¼ããããã¯ã¼ã¯ã®é ã層ã®æ´»æ§åã®ã¢ããã¸ã¼ã¨è¦åããã¨ãã§ããï¼ãã£ã¦ softmax ã®åºå (注æã®éã¿) ã«ããããã¢ã¦ããé©å¿ããï¼
- Rethinking the Inception Architecture for Computer Vision [Christian Szegedy, arXiv, 2015/12]
åãè¾¼ã¿å±¤ã®è¾æ¸è¡åã¨ã½ããããã¯ã¹å±¤ã®è¡åã¯éã¿å ±æãã [Press, 2016]ï¼
- Using the Output Embedding to Improve Language Models [Ofir Press, arXiv, 2016/08]
è©ä¾¡æã¯ãã¼ã æ¢ç´¢ (ãã¼ã å¹ ï¼é·ãã®ããã«ã㣠) ã使ç¨ããï¼
çµæ
å®é¨çµæã¯æ¬¡å³ã®ã¨ãã.
WMT'14 è±ç¬ã¯å·¨å¤§ã¢ãã« (big) ã 28.4 ã® BLEU ã¹ã³ã¢ãéæãã (以åã® SOTA ãã 2.0 é«ã)ï¼åºæ¬ã¢ãã« (base) ã§ãã£ã¦ãï¼é常ã«å°ãªãè¨ç·´ã³ã¹ãã§ã¢ã³ãµã³ãã«å¦ç¿ãå«ã以åã® SOTA ããæ§è½ãåªãã¦ããï¼
WMT'14 è±ä»ã¯å·¨å¤§ã¢ãã« (big) ã 41.0 ã® BLEU ã¹ã³ã¢ãéæãï¼ã·ã³ã°ã«ã¢ãã«ã®ä»¥åã® SOTA ã«æ¯ã¹ã¦ 1/4 ã®è¨ç·´ã³ã¹ãã§ããã«ãé¢ãããæ§è½ãåªãã¦ããï¼
åè¦ç´ ã®éè¦åº¦ã測ãããã«ï¼åºæ¬ã¢ãã« (base) ã«æ§ã ãªå¤æ´ãå ãã¦æ§è½ãè©ä¾¡ããï¼ã¾ã巨大ã¢ãã« (big) ãè©ä¾¡ããï¼
è¨ç®éãä¸å®ã«ä¿ã¡ãªãã注æãããã®æ°ã¨ 㨠ã®æ¬¡å æ°ãå¤æ´ããã¨ããï¼åä¸ã®ãããã¯æé«ã®è¨å®ã«æ¯ã¹ã¦ BLEU ã¹ã³ã¢ã 0.9 ãæªãï¼éã«ããããå¤ããã¦ãæ§è½ãè½ã¡ãï¼
èªå·±æ³¨æã®å¯è¦å
è¤æ°ãããã®èªå·±æ³¨æ (ã¨ã³ã³ã¼ãã®ç¬¬ 5 ãããã¯) ãå¯è¦åããã¨ããï¼åããããæ§æãæå³æ§é ãå¦çãããã¨ãå¤æããï¼
ã âmakingâ
ã®ã¨ã 8 åã®ãããã®èªå·±æ³¨æã¯æ¬¡å³ã®ã¨ããï¼
ä¸æ®µã¯ ï¼ä¸æ®µã¯ ã§ããï¼8 è²ã®ãã¼ã«ã¼ã®æ¿æ·¡ã¯ 8 åã®ãããã®æ³¨æã®éã¿ã®å¤§ããã§ããï¼ç·ã®è²ã¯æ³¨æã®éã¿ãæ大ã®ãããã示ãã¦ããï¼
ãã®ä¾ã§ã¯å¤ãã®ãããã âmakingâ
ã¨ããåè©ã®é·æä¾åãæãï¼âmaking...more difficultâ
ã¨ããå¥ãå½¢æãã¦ããï¼
ã âitsâ
ã®ã¨ã 2 åã®ãããã®èªå·±æ³¨æã¯æ¬¡å³ã®ã¨ããï¼
ç´«è²ã®æ¿æ·¡ã¯ 1 åç®ã®ãããï¼è¶è²ã®æ¿æ·¡ã¯ 2 åç®ã®ãããã®æ³¨æã®éã¿ã®å¤§ããã§ããï¼
ãã®ä¾ã§ã¯ âitsâ
ã¨ãã代åè©ã âLawâ
ã¨ããåè©ãæãã¨ããç
§å¿é¢ä¿ãå½¢æãã¦ããï¼ã¾ãï¼æ³¨æã®éã¿ãéä¸ãã¦ããï¼
èªå·±æ³¨æ
èªå·±æ³¨æã®èµ·æºã¯ Yoshua Bengio ã®ç 究ã°ã«ã¼ããææ¡ããæç« ã®åãè¾¼ã¿ãã¯ãã«ãæ±ããææ³ã§ããï¼
- A Structured Self-attentive Sentence Embedding [Zhouhan Lin, last: Yoshua Bengio, arXiv, 2017/03]
ãã¬ãã¸å¤å®ç¨ã®æãåãè¾¼ãã çµæã¯æ¬¡ã®ã¨ããï¼èµ¤ããã¼ã«ã¼ã®æ¿æ·¡ã¯æ³¨æã®éã¿ (Attention Weight) ã®å¤§ããã§ããï¼
ååèªã®åãè¾¼ã¿ãã¯ãã«ã ï¼èªå·±æ³¨æã¸ã®å ¥åã ã¨ããï¼
注æã®éã¿ ã¯å ¥å ãç¨ãã¦æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ãã㧠ã¯éã¿è¡åã§ããï¼
å
¥å ããå¾ããã注æã®éã¿ ã¨å
¥å ãå
ç©ãã¦æç« ã®åãè¾¼ã¿ãã¯ãã« ãæ±ããï¼
ã¤ã¾ãï¼èªå·±æ³¨æã¯å ¥å ã 2 ã¤ã«è¤è£½ (ã³ãã¼) ãï¼ä¸æ¹ã®åå²ã«ä»»æã®é¢æ°ãé©å¿ãã¦æ³¨æã®éã¿ ãæ±ãï¼åå²ãçµ±å (ie, å ç©) ããææ³ã§ããã¨è¨ããï¼
ããã§å ¨çµåã¨èªå·±æ³¨æãæ¯è¼ãããï¼
é ä¼æ | éä¼æ | |
---|---|---|
å ¨çµå | ||
èªå·±æ³¨æ |
å ¨çµåã¯ãéã¿ãã¨ãå ¥åãã®å ç©ã§ããï¼èªå·±æ³¨æã¯ã注æã®éã¿ãã¨ãå ¥åãã®å ç©ã§ããï¼
èªå·±æ³¨æã¨ä»ã®ãããã¯ã¼ã¯
èªå·±æ³¨æ層ã¨ä»ã®ãããã¯ã¼ã¯ (ãªã«ã¬ã³ã層ã¨ç³ã¿è¾¼ã¿å±¤) ãæ¯è¼ããï¼
- Complexity per Layer: 1 層ãããã®åè¨ã®è¨ç®è¤éæ§ (è¤éæ§ãä½ãã»ã©ã¹ã±ã¼ã«ã¢ãããããã)
- Sequential Operations: ç³»åãå¦çããæå°ã®æä½åæ° (æä½åæ°ãå°ãªãã»ã©ä¸¦ååãããã)
- Maximum Path Length: ä»»æã®å ¥åä½ç½®ã¨åºåä½ç½®ãçµã¶ãã¹ã®æ大çµè·¯é· (ãã¹ãçãã»ã©é·è·é¢ä¾åãå¦ã³ããã)
ãã㧠ã¯åèªåã®é·ãï¼ ã¯ååèªã®æ¬¡å æ°ï¼ ã¯ç³ã¿è¾¼ã¿ã®ã«ã¼ãã«ãµã¤ãºï¼ ã¯å¶ç´èªå·±æ³¨æã®è¿åã®ãµã¤ãºã§ããï¼ãããã¯ãªã¼ãã¼ãå°ããã»ã©åªãã¦ããï¼
èªå·±æ³¨æ層ã®æ³¨æã®éã¿ã¯åèªåã®é·ã ã® 2 ä¹ãªã¼ãã¼ãªã®ã§ ã®è¤éæ§ãããï¼ä¸æ¹ï¼ãªã«ã¬ã³ã層ã®éã¿è¡åã¯æ¬¡å
æ° ã® 2 ä¹ãªã¼ãã¼ãªã®ã§ ã®è¤éæ§ãããï¼
ä¸è¬çã« < ãªã®ã§ï¼èªå·±æ³¨æ層ã¯ãªã«ã¬ã³ã層ããé«éã§ããï¼
èªå·±æ³¨æ層ãç³ã¿è¾¼ã¿å±¤ã¯ä¸å®ã®æä½åæ°ã§å¦ç (ãã¹ã¦ã®ä½ç½®ãæ¥ç¶) ã§ãããï¼ãªã«ã¬ã³ã層㯠ã®ã¿ã¤ã ã¹ãããæä½ãå¿ è¦ã§ããï¼
åä¸ã®ç³ã¿è¾¼ã¿å±¤ã¯ < ãªã®ã§å
¥åºåä½ç½®ã®å
¨ãã¢ãæ¥ç¶ããªãï¼é常ã®ç³ã¿è¾¼ã¿å±¤ãªããæ¡å¼µç³ã¿è¾¼ã¿å±¤ (Dilated Convolution) ã使ç¨ãã¦å
¨ãã¢ãæ¥ç¶ããã«ã¯ï¼ãããã 㨠ã®å±¤ãã¹ã¿ãã¯ããå¿
è¦ãããï¼
å¶ç´èªå·±æ³¨æã¯åºåä½ç½®ãä¸å¿ã¨ãããµã¤ãº ã®è¿åã«ä½ç½®ããå
¥ååèªåã®ã¿è¨ç®ããã®ã§ï¼è¤éæ§ãå°ãããªã代ããã«æ大çµè·¯é·ã ã«å¢å¤§ããï¼
èªå·±æ³¨æ â ILSVRC 2017 åªåã¢ãã«ãå§æ¾ã¨å±èµ·ã
èªå·±æ³¨æã¯ç»åèªèã®åéã§ãé«ãæ§è½ã確èªããã¦ããï¼
- Squeeze-and-Excitation Networks [Jie Hu, ILSVRC 2017 Winner, arXiv, 2017/09], èè ãã®ã¹ã©ã¤ã
- Residual Attention Network for Image Classification [Fei Wang, arXiv, 2017/04]
èªå·±æ³¨æã¯åãå ´æããæ¥ãå ¥åã 2 ã¤ã«è¤è£½ãï¼ä¸æ¹ (ãããã¯ä¸¡æ¹) ã®åå²ã«ä»»æã®é¢æ°ãé©å¿ãã¦ã注æã®éã¿ããæ±ãï¼åå²ãçµ±å (ie, å ç©ï¼ã¢ããã¼ã«ç©) ããææ³ã¨è§£éã§ããï¼
ImageNet ã§é«ãæ§è½ãçºæ®ãããå§æ¾ã¨å±èµ·ãããã¯ã¼ã¯ãã¨ãæ®å·®æ³¨æãããã¯ã¼ã¯ãã¯èªå·±æ³¨æã¨åãæ§é ããã¦ããï¼
Squeeze-and-Excitation Network | Residual Attention Network |
---|---|
å§æ¾ã¨å±èµ·ãããã¯ã¼ã¯ (SENet, Squeeze and Excitation Network) ã¯ãã©ã®ãã£ãã«ãéè¦ããããã¨ãã注æã®éã¿ãæ±ãã¦ãã£ãã«æ¹åã«ã¢ããã¼ã«ç©ããï¼æ³¨æã®éã¿ã¯ãã°ãã¼ãã«ãªæ å ±ããã£ãã«ã®æ¬¡å æ°ã®ãã¯ãã«ã«åãè¾¼ã Squeezeãã¨ããã¯ãã«ãå調æ´ãã Excitationãããæ±ããï¼
æ®å·®æ³¨æãããã¯ã¼ã¯ (RAN, Residual Attention Network) ã¯ãç¹å¾´ãããã®ã©ã®é¨åã注è¦ããããã¨ãã注æã®éã¿ãæ±ãã¦ã¢ããã¼ã«ç©ããï¼æ³¨æã®éã¿ã¯ãç¹å¾´ãããã縮å°ãã¦æèãæ±ãã bottom-upãã¨ãæèããåãã¯ã»ã«ã®å¼·å¼±ãæ±ãã top-downãããæ±ããï¼
CIFAR 㨠ImageNet ã®ã¨ã©ã¼ç SOTA ãæ¯è¼ããï¼
Method | CIFAR-10 | CIFAR-100 | ImageNet Top-1 | ImageNet Top-5 |
---|---|---|---|---|
RAN (Attention-92) | 4.99 | 21.71 | 19.5 | 4.8 |
RAN (Attention-452) | 3.90 | 20.45 | - | - |
SENet | - | - | 17.28 | 3.79 |
Shake-Shake | 2.86 | 15.85 | - | - |
ShakeDrop | 2.31 | 12.19 | - | - |
- Shake-Shake Regularization [Xavier Gastaldi, arXiv, 2017/05]
- ShakeDrop Regularization [å±±ç°è¯å, OpenReview, 2017/10]
両è ã®ãããã¯ç´æ¥æ¯è¼ã§ããªããï¼ç³ã¿è¾¼ã¿å±¤ã®ç¢ºççãªæ··åãæ¶å»ã»ãã¼ã¿æ¡å¼µãå¼·ãã£ã½ã (ã½ã¼ã¹)ï¼
èªå·±å帰ã使ããªã Transformer
- Non-Autoregressive Neural Machine Translation [Jiatao Gu, last: Richard Socher, ICLR 2018]
è¨ç·´ã»æ¨è«æã¨ãã«èªå·±å帰ã使ããé«éã«ç¿»è¨³ã§ãã Transformer ã§ããï¼ã¢ãã«ã¯ [èªå·±æ³¨æ, (ä½ç½®ã¨ã³ã³ã¼ãã£ã³ã°ä»ã) èªå·±æ³¨æ, ã½ã¼ã¹ã¿ã¼ã²ãã注æ, ä½ç½®æ¯ã® FFN] ã®ãããã¯ã 層ã¹ã¿ãã¯ãã¦ããï¼
ã¨ã³ã³ã¼ãã®åºåããã¨ã³ã³ã¼ãã®å ¥åæãããã³ã¼ãã®å ¥åæã«ã³ãã¼ããååèªã®æ°ããæ±ãã¦ãã (ie, Fertility Prediction)ï¼ãã£ã¦ãã³ã¼ãã¯èªå·±å帰ã使ç¨ããäºã決ã¾ã£ãæ°ã®åèªãçæããï¼
DeepL Translator
深層å¦ç¿ã¯ãã¼ã¿éã¨ãã·ã³ãã¯ã¼ã§æ®´ãã®ãæ£è§£ï¼
DeepL ãè¡æçãªãã¬ã¹ãªãªã¼ã¹ãå
¬éããï¼ãã¾ãã« BLEU ã¹ã³ã¢ãé«ãã㦠Transformer ãéãã§è¦ããï¼
Google 翻訳ããé«æ§è½ã ã£ãã®ã§ç»å ´æ㯠Google 社å
ã§å¤§åé¡ã«ãªã£ããããï¼
- Press Information â DeepL Translator Launch [Gereon Frahling, DeepL, 2017]
DeepL Translator ã®éçºå ã¯å¯¾è¨³ææ¤ç´¢ã¨ã³ã¸ã³ Linguee (2009~) ãéå¶ãã¦ããï¼Linguee ããã¯ãã¼ã«ãã 10 åæã§è¨ç·´ãã¦ããï¼å¤§è¦æ¨¡ã³ã¼ãã¹ã ã¨è¨ããã WMT'14 è±ä» 3000 ä¸æï¼è±ç¬ 450 ä¸æã¨æ¯ã¹ã¦ãé常ã«å·¨å¤§ãªè¨ç·´ãã¼ã¿ã§ããï¼
ã¾ãï¼5.1 petaFLOPS (ä¸çã©ã³ãã³ã° 23 ä½ç¸å½) ã®ã¹ãã³ã³ãææãã¦ãã (PFN ã® 1024 GPU ãããé«ç«å)ï¼ãããï¼
è«æã¯æªå ¬éã ãï¼ãããã巨大㪠Transformer ã«è¿ãã¢ãã«ã使ç¨ãã¦ããã¨æ¨å¯ãããï¼
æ®å¿µãªãã DeepL Translator ã¯æ¥è±ç¿»è¨³ãæä¾ãã¦ããªãï¼ãã ã Linguee ã¯æ¥è±å¯¾è¨³æã大éã«ææãã¦ããã®ã§ä»å¾å¯¾å¿ããå¯è½æ§ã¯ããï¼
ã¾ãï¼ç¹è¨±åºã¨ NICT ãå ±å㧠3.5 åæã®æ¥è±å¯¾è¨³ãã¼ã¿ã»ãã (JPO ã³ã¼ãã¹) ãå ¬éãã¦ããï¼å¤§è¦æ¨¡ãã¼ã¿ã®æ®´ãåãã好ããªæ¹ã¯ç³è«ãã¦ã¿ãã¨è¯ãããï¼