深層å¦ç¿ã«ããèªç¶è¨èªå¦ç - RNN, LSTM, ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã®çè«
æ¬ç¨¿ã§ã¯ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ï¼èª¤å·®éä¼ææ³ï¼è¨èªã¢ãã«ï¼RNNï¼LSTMï¼ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã®ä¸é£ã®ææ³ã«ã¤ãã¦æ°ççã«è§£èª¬ããï¼
åç·¨ã®ç®æ¬¡
ãã¥ã¼ã©ã«ãããã¯ã¼ã¯
ãªã«ã¬ã³ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ (RNN)
- Recurrent Neural Network Language Model (RNNLM)
- Backpropagation Through Time (BPTT)
- Long Short-Term Memory (LSTM)
- Gated Recurrent Unit (GRU)
- RNN ã®ããããã¢ã¦ãã¨ãããæ£è¦å
ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ (NMT)
è©ä¾¡ææ³
arXiv ã§è¿½ãææ°ã®ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³
ãã¥ã¼ã©ã«ãããã¯ã¼ã¯
ã¾ã£ãããã¥ã¼ã©ã«ãããã¯ã¼ã¯ã®é ä¼æã»éä¼æãç¥ããªãæ¹ã¯ï¼ã¾ã 誤差éä¼ææ³ã®ãã¼ã ããã㯠éä¼æã®ä»çµã¿ ãå ã«éèªãããã¨ãå§ããï¼æåå³åã ãåè æ©æ¢°å¦ç¿ã¨æ·±å±¤å¦ç¿ã®æ°ç ã§ãåæ©ãããã¥ã¼ã©ã«ãããã¯ã¼ã¯ã解説ãã¦ããï¼
é ä¼æ (Forwardpropagation)
ã¦ã©ã¼ãã³ã°ã¢ããã« ã¦ãããã®å ¥å層 (é)ï¼ ã¦ãããã®é ã層 (ç´«)ï¼ ã¦ãããã®åºå層 (赤) 㮠層ã®ãã¥ã¼ã©ã«ãããã§é ä¼æãè¦ã¦ã¿ãï¼, éã®éã¿è¡åã ï¼ãã¤ã¢ã¹ãã¯ãã«ã ï¼, éã®éã¿è¡åã ï¼ãã¤ã¢ã¹ãã¯ãã«ã ã¨ããï¼, ã¯æ´»æ§ (éã¿ä»ãç·å) ã§ï¼æ´»æ§åé¢æ° ã¯ãã¹ã¦ã·ã°ã¢ã¤ãé¢æ°ã¨ããï¼
ä¸å³ã®ãããã¯ã¼ã¯ã¨ã°ã©ãã¯å義ã§ããï¼ãã¤ã¢ã¹ãã¯ãã«ã¯çç¥ï¼
ãããã¯ã¼ã¯ | ã°ã©ã |
---|---|
ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼
深層ã«ãªã£ã¦ãåæ§ã®æç¶ãã§ããï¼
éä¼æ (Backpropagation)
誤差éä¼æ (Back-prop) ã¨ã¯ï¼æ失é¢æ°ãåãã©ã¡ã¼ã¿ã§å¾®åãã¦ï¼åãã©ã¡ã¼ã¿ (Data) ã«ãããå¾é (Grad) ãæ±ãï¼æ失é¢æ°ãå°ãããªãæ¹åã¸ãã©ã¡ã¼ã¿æ´æ°ãè¡ããã¨ãããï¼ããã§å¾é ã¯åãã©ã¡ã¼ã¿ã«ä»éããå¤æ°ã¨æããããï¼Chainer å é¨ã§ãï¼Variable ã¤ã³ã¹ã¿ã³ã¹ã«ãã©ã¡ã¼ã¿ (éã¿è¡åã¨ãã¤ã¢ã¹ãã¯ãã«) ãä¿æãã Data ã¨ï¼åãã©ã¡ã¼ã¿ã®å¾é ãä¿æãã Grad ã®2ã¤ãããï¼ãããã« forward ã¡ã½ãã ã backward ã¡ã½ãããé©å¿ã㦠Variable ãæ´æ°ãã¦ããï¼
Back-prop ã«ãããé£éå¾ã¨ã¯ï¼å¤å¤æ°é¢æ°ã®å¾®åã§ããï¼ä¸å¤æ°é¢æ°ãªããåæé¢æ°ã®å¾®åãã ãã§è¯ããï¼å¤å¤æ°é¢æ°ã®å ´åã¯ãåæé¢æ°ã®å¾®åã®åãï¼è¨ãæããã°ããããã¯ã¼ã¯ã®åçµè·¯ããæ±ã¾ãå¾é ã®ç·åãã¨ãªãï¼è¨èã§ã¯èª¬æãã¥ããã®ã§ï¼ 層ã®ãã¥ã¼ã©ã«ãããã§å層éã®éã¿è¡åã® æåã«ä»éããå¾é ãæ±ãï¼å±¤ã®æ·±ãã¨å¾é ã®é¢ä¿ã追ã£ã¦ã¿ããï¼
æ失é¢æ°ã¯äºä¹èª¤å·®é¢æ°ã使ãï¼ãã ã ã¯æ師信å·ï¼
ï¼åå¾®åã¯
æ´»æ§åé¢æ°ã¯ã·ã°ã¢ã¤ãé¢æ°ã使ãï¼
ï¼åå¾®åã¯
ã®å¾é ãæ±ããï¼Back-prop ã¯ï¼ => ã® çµè·¯ï¼
ã®å¾é ãæ±ããï¼Back-prop ã¯ï¼ => => ã® çµè·¯ï¼ãã£ã¦å¾é 㯠çµè·¯åã®åå¾®åã®ç·åã¨ãªãï¼
ã®å¾é ãæ±ããï¼Back-prop ã¯ï¼ => => => ã® çµè·¯ï¼ãã£ã¦å¾é 㯠çµè·¯åã®åå¾®åã®ç·åã¨ãªãï¼
ã®å¾é ãæ±ããï¼Back-prop ã¯ï¼ => => => => ã® çµè·¯ï¼ãã£ã¦å¾é 㯠çµè·¯åã®åå¾®åã®ç·åã¨ãªãï¼
ç´æçã«ã¯1層å¢ãããã¨ã«ãéã¿ãã¨ãæ´»æ§åé¢æ°ã®å¾®åãã¨ãã·ã°ã (çµè·¯æ°ã®å)ãã 1ã¤ãã¤å¢ãããã¨ããããï¼å¾è¿°ã®ãªã«ã¬ã³ããã¥ã¼ã©ã«ãããã® Backpropagation Through Time ã§ã¯ï¼å層ã«ããã誤差ã次å¼ã«ãã£ã¦è¡¨ããï¼ ã®æ¼¸åå¼ã¨ãã¦çç®ããã¨ï¼é¡è¡ãã層ã1ã¤æ·±ããªããã¨ã«ãéã¿ãã¨ãæ´»æ§åé¢æ°ã®å¾®åããå ãã£ã¦ãããã¨ãããã (ãã¯ãã«è¡¨è¨ãªã®ã§ã·ã°ãã¯ãªã)ï¼
ãªã«ã¬ã³ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ (RNN)
Recurrent Neural Network (RNN) ã¯å帰çãªæ§é ãæã£ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ã§ããï¼ä¸»ã«é³å£°èªèãèªç¶è¨èªå¦çã®ãããªç³»åãã¼ã¿ã®å¦çã§æ±ãããï¼æ°ççãªè§£èª¬ã¯æ°å¼ã§æ¸ãä¸ããªã«ã¬ã³ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ãè¯ãï¼ãããªãæ°å¼ã¯é¢é£ããã¨ãã人ã¯Recurrent Neural Network Language Modelã®è©±ãå ã«å§ããï¼ãã㯠RNN ããã¥ã¼ã©ã«è¨èªã¢ãã« (Neural Language Model) ã«å¿ç¨ãã RNNLM ã®è§£èª¬ã§æå¿«ã§ããï¼
ãã¥ã¼ã©ã«è¨èªã¢ãã«ã§ã¯æ¬¡å¼ã«ãã£ã¦æç« ã®ç¢ºçãæ±ããï¼
éå»ã«å ¥åãã åã®åèªãã次ã®åèªãçèµ·ããæ¡ä»¶ä»ã確ç ãæ±ããããï¼æå°¤æ¨å®ããã¼ã¹ã«å¹¾ã¤ãã®æ¹æ³ãããï¼1-gram ã¢ãã«ã§ã¯å±¥æ´ (éå»ã®åèª) ãåç §ããï¼æ¬¡å¼ã®ããã«åèªã®åºç¾åæ°ãã«ã¦ã³ãããï¼ã¯ã³ã¼ãã¹ã®å ¨åèªï¼
æ¬ç¨¿ã§æ±ã n-gramè¨èªã¢ãã«ã§ã¯æ¬¡å¼ã®ããã«å±¥æ´ãåç §ãï¼æèãèæ ®ããï¼
Recurrent Neural Network Language Model (RNNLM)
ç¾å¨ãéå»ã«å
¥åããåèªãããã£ã¨ãããã次ã®åèªãäºæ¸¬ããã¿ã¹ã¯ãèããï¼ä¸å±¤ã® RNN ã¯è©¦è¡ ã«ãããå
¥å層 , é ã層 , åºå層 ã¨ï¼è©¦è¡ ã«ãããé ã層 ã«ãã£ã¦è¡¨ãããï¼éå»ã®è©¦è¡ã«ãããé ã層ã次ã®è©¦è¡ã®å
¥å層ã«å
¥åãããã¨ã§ï¼éå»ã®æ
å ±ãèæ
®ãã¦ããï¼
ã¢ãã«ã¯ä¸å³ã®éãï¼RNNLMéçºè
ã®ã¹ã©ã¤ãããæç²ï¼
å ¥å層~é ã層ã®é ä¼æã¯æ¬¡å¼ã§è¡¨ãããï¼ãã¤ã¢ã¹ãã¯ãã«ã¯çç¥ï¼
ã¨ããã¨ï¼
ãã ãï¼ï¼å¾®å㯠ã¨ãªãï¼
- 㯠one-hot ( æåã ã , ä»ã®æåã¯å ¨ã¦ ) 㪠次å åèªãã¯ãã«ï¼ (åèªæ°) 㯠~
- ã¯éã¿è¡å (åèªã®è¾æ¸)ï¼è¡æ° = Word Embeddings (åèªã®ç¹å¾´ã表ç¾ãããã¯ãã«) ã®æ¬¡å æ° ï¼åæ° = è¾æ¸ã®åèªæ° ï¼
- ã¯è©¦è¡ ã«ããã 次å é ã層ãã¯ãã«
- ã¯éã¿è¡åï¼è¡æ° = åæ° = é ã層ãã¯ãã«ã®æ¬¡å æ° ï¼
- ã¯ã·ã°ã¢ã¤ãé¢æ° (åæåã ~ ã®å¤ã«éç·å½¢å¤æ)ï¼æ®é㯠ã使ããè¨ç®ãç°¡ç¥å
- ã¯è©¦è¡ ã«ããã 次å é ã層ãã¯ãã«ï¼h 㯠~
é ã層~åºå層ã®é ä¼æã¯æ¬¡å¼ã§è¡¨ãããï¼ãã¤ã¢ã¹ãã¯ãã«ã¯çç¥ï¼
ã¨ããã¨ï¼
ãã ãï¼ï¼
å¾®å㯠ã¨ãªãï¼
- ã¯éã¿è¡åï¼åè¡ã¯è¾æ¸ã®ååèªã«å¯¾å¿ãã¦ããï¼åè¡ã¨ ã®å ç©ã«ãã£ã¦ååèªã®ã¹ã³ã¢ã決å®ããï¼è¡æ° = è¾æ¸ã®åèªæ° , åæ° = é ã層ãã¯ãã«ã®æ¬¡å æ° ï¼
- ã¯ã½ããããã¯ã¹é¢æ° (åæåã ~ ã®å¤ï¼åè¨ã ã«éç·å½¢å¤æ)
- 㯠次å ãã¯ãã«ã®ç¢ºçåå¸ï¼ååèªã®ç起確çãäºæ¸¬ããï¼
誤差éä¼æ (Back-prop) ã¨ãã¦ï¼äºä¹èª¤å·®é¢æ°ã使ããã¨ãèããï¼ ã¯ãã¼ã¿ã»ããã®ã¤ã³ããã¯ã¹ï¼ ã¯æ師ãã¼ã¿ã®ã©ãã«ã表ãï¼
ãã®åå¾®å㯠ã§ããï¼
æ±ãããã¢ãã«ã®ãã©ã¡ã¼ã¿ã¯ ãªã®ã§ï¼åãã©ã¡ã¼ã¿ 㧠ãåå¾®åãï¼å¾é ãæ±ãï¼äºä¹èª¤å·®é¢æ°ãæå°åããæ¹åã¸ãã©ã¡ã¼ã¿ãæ´æ°ãã¦ããï¼
ãã ãï¼ï¼ ï¼
㨠㯠Back-prop ã«ããã¦èª¤å·®ã¨å¼ã°ããï¼ããã¯æ¬¡å¼ã«ãã£ã¦æ±ãããã¨ãã§ããï¼
ãã ãï¼ ã¯ã¢ããã¼ã«ç© (æåãã¨ã®ç©)ï¼
以ä¸ããï¼æ¬¡å¼ã«ãã£ã¦ãã©ã¡ã¼ã¿ãæ´æ°ããï¼ ã¯å¦ç¿çï¼
ããã§ãã¾ããããã¨æãããããã§ã¯ãªãï¼äºä¹èª¤å·®é¢æ°ã使ãã¨èª¤å·®ã®è¨ç®ãç ©éã«ãªã£ã¦ãã¾ãã®ã§ä»£æ¿ã¨ãã¦ä½¿ãããã®ãï¼ã¯ã©ã¹åé¡ã§ããªãã¿ã®äº¤å·®ã¨ã³ãããã¼èª¤å·®é¢æ°ã§ããï¼ ã¯ãã¼ã¿ã»ããã®ã¤ã³ããã¯ã¹ï¼ ã¯åºå層ã®ã¦ãããã表ãï¼
ãããæ失é¢æ°ã«ä½¿ãã¨ï¼ ã¯ç°¡æ½ã«è¡¨ãããï¼éä¸å¼ã®ã½ããããã¯ã¹é¢æ°ã¯åºå層ã®æåå士ã«ä¾åé¢ä¿ãããã®ã§ï¼ãã¯ãã«ã§ã¯ãªã æåã§èããï¼
ãã㧠ã¯ã½ããããã¯ã¹é¢æ°ã§ï¼
å¾®åã¯
㨠ã§å ´ååããã¦è¨ç®ããã¨ï¼
ããã§æ師信å·ã¯ï¼æ£è§£ã® 㧠ï¼ä¸æ£è§£ã® 㧠ãªã®ã§
ï¼ãã£ã¦ï¼
Backpropagation Through Time (BPTT)
ããã¾ã§ã®å¼ã¯ï¼ ããã¿ã¦ç´è¿ã®éå»ã§ãã ã®èª¤å·®ã¾ã§ãã Back-prop ã«åæ ãã¦ããªãï¼ããã§ï¼ããéå»ã«é¡ã£ã¦èª¤å·®ãåæ ãããææ³ã¨ã㦠BPTT ã使ãããï¼
ä¸å³ã§ã¯ï¼ éå»ã®ç¶æ
ãä¿æãã¦éå»ã®èª¤å·®ãå ãã truncated BPTT ã使ã£ã¦ããï¼
ã²ã¨ã¾ã 1 ã¹ãããã ãé¡ã (誤差 ã ã®å¼ã§è¡¨ã) ã¨ï¼
ãããä¸è¬åããã¨æ¬¡å¼ã®ãããªå帰çãªæ¼¸åå¼ã¨ãã¦è¡¨ãããï¼ ã¯é¡è¡æ° (ã©ããããéå»ã«é¡ãã) ã表ããã©ã¡ã¼ã¿ï¼ä¸è¬çã« ã使ãããï¼
ãã£ã¦ï¼æ¬¡å¼ã«ãã£ã¦ãã©ã¡ã¼ã¿ãæ´æ°ããï¼
ä½è«ã ãï¼RNN ã® Back-prop ã«ã¯ï¼BPTT 以å¤ã«ã Real Time Recurrent Learning (RTRL) ãªã©å¤æ°ããï¼
ã¾ãï¼Recurrent NN ã¨æ··åãããã¡ãª Recursive NN (RNN) 㯠â¾ç¶â¾èªå¦çåéã«ããããã£ã¼ãã©ã¼ãã³ã°ã®ç¾ç¶ ãè¯ã解説ã ã£ãï¼
è¿æ³
- Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations (arXiv, 2016/05): LSTMã®ãããªã²ã¼ãä»ããããã¯ã使ããé·æä¾åãè¨æ¶
- Memory-Efficient Backpropagation Through Time (arXiv, 2016/06): LSTM ã®ã¡ã¢ãªã®é ãç¶æ ã BPTT
- An Actor-Critic Algorithm for Sequence Prediction (arXiv, 2016/07): Bengio ã®ã¨ãï¼å¼·åå¦ç¿ã® Actor-Critic ã§è¨èªã¢ãã«ã®ããã«æ¬¡ã®åèªãäºæ¸¬ãã試ã¿
- Tuning Recurrent Neural Networks with Reinforcement Learning (project page, 2016/11): Google ã® magenta ã°ã«ã¼ãï¼RNN ã DQN ã§ãã¥ã¼ãã³ã°
Long Short-Term Memory (LSTM)
RNN ã®é ã層ã LSTM block ã«ç½®ãæããäºç¨®ã« LSTM ãããï¼RNN ãè¦æã¨ããç³»åãã¼ã¿ã®é·æä¾åãå¦ç¿ã§ããããèªç¶è¨èªå¦çã§ãã使ãããï¼LSTM block ã¯ï¼ã¡ã¢ãªã¨ï¼ã¤ã®ã²ã¼ã (å ¥åã²ã¼ã, åºåã²ã¼ã, å¿å´ã²ã¼ã) ããæãï¼ã¡ã¢ãªã¯ï¼Constant Error Carousel ã¨ãããã¯ãã«ã«éå»ã®å ¥åãæºãè¾¼ãå½¹å²ãããï¼åã²ã¼ãã¯é¢æ°ãªã®ã§ï¼LSTM block èªä½ï¼é¢æ°ãåæããé¢æ°ã¨ãããï¼ãã®ãããã¯ï¼ä»¥ä¸ã®è³æãåèã«ãªãï¼
- å³ã¯ããããå¼ç¨: Understanding LSTM Networks
- å¾é æ¶å¤±çºæ£åé¡ã«å¯¾ãã LSTM ã®æç¨æ§: ãããLSTM ï½ æè¿ã®ååã¨å ±ã«
- beam2d ããã® RNN, BPTT, LSTM ã®è§£èª¬: Recurrent Neural Networks
LSTM block
å ¥å層ããã®å ¥åã , LSTM block ã®åºåã , åã¹ããã ã«ããã LSTM block ã®åºåã ã¨ããï¼ã¾ãï¼ ã¨ ã¯éã¿è¡åï¼ ã¯ãã¤ã¢ã¹ãã¯ãã«ï¼ 㯠tanh é¢æ°ï¼ ã¯ã·ã°ã¢ã¤ãé¢æ°ã表ãï¼
å ¥å (å³) ã¨å ¥åã²ã¼ã (å·¦) | å¿å´ã²ã¼ã |
---|---|
LSTM block ã¸ã®å ¥å (ç´ ã® RNN ã¨åã) ã¯ï¼æ¬¡å¼ã§è¡¨ãããï¼
å ¥åã²ã¼ã (Input Gate) ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼
å¿å´ã²ã¼ã (Forget Gate) ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼
ã¡ã¢ãªã»ã« | åºåã²ã¼ã (å·¦) ã¨åºå (å³) |
---|---|
ã¡ã¢ãªã»ã«ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼ ã¯ã¡ã¢ãªã®åºåï¼ ã¯ã¢ããã¼ã«ç© (æåãã¨ã®ç©) ã表ãï¼ ã¡ãªã¿ã«ï¼ãã®ã¡ã¢ãªãå¾é æ¶å¤±åé¡ã®åé¿ã«å¯ä¸ãã¦ããï¼å¾é æ¶å¤±ã¯å¤å±¤ã§ã·ã°ã¢ã¤ãé¢æ°ãéãªããã¨ã§èµ·ãã (ã·ã°ã¢ã¤ãé¢æ°ã®å¾®åã®å¤å㯠~ ãªã®ã§æããåæ°ãå¢ããã¨å¾é ã ã«ãªã) ã®ã§ï¼éç·å½¢å¤æãæããªãã¡ã¢ãªã¯å¾é ãæ®ããããï¼ä»ã®å¾é æ¶å¤±ã«å¯¾ããã¢ããã¼ãã¨ãã¦ï¼æ£å¤ã®å ¥åãæçååãã ReLU (Rectifier Linear Unit) ãå ¥åããã®ã¾ã¾åºåã«å ç®ãã Residual Block ã深層ã¾ã§å¾é ãæ®ããããï¼
åºåã²ã¼ã (Output Gate) ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼
LSTM block ã®åºåã¯æ¬¡å¼ã§è¡¨ãããï¼
以ä¸ã®é¢æ°ã®åæé¢æ°ã LSTM block ã¨ãªãï¼ ã¨ ã次ã¹ãããã® LSTM block ã«å¼ãç¶ãããï¼
åãã©ã¡ã¼ã¿ã®æ¬¡å æ°ã確èªããã¨ï¼å ¥å ã 次å ãã¯ãã« (Word Enbeddings)ï¼ åºå ã 次å ãã¯ãã« ( ã 次å ãã¯ãã«) ã¨ããã¨ï¼ 㯠éã¿è¡åï¼ ã¯ éã¿è¡å, ãã¤ã¢ã¹é 㯠次å ãã¯ãã«ã«ãªãï¼
è¦ãç©´çµå (peephole connections) | GRU block |
---|---|
LSTM ã«ã¯ç¾å¨ã¾ã§ã«æ§ã ãªãã¼ã¸ã§ã³ãææ¡ããã¦ããï¼
è¦ãç©´çµå (peephole connections) ã¯ï¼ã¡ã¢ãªã»ã«ã®å é¨ç¶æ ã3ã¤ã®ã²ã¼ãã®å¶å¾¡ã«ç´æ¥å©ç¨ããä»çµã¿ã§ããï¼å ¥åã²ã¼ãã»å¿å´ã²ã¼ãã« ï¼åºåã²ã¼ãã« ã peepholeéã¿ã¨ã¢ããã¼ã«ç©ãã¦å ç®ãã¦ããï¼
peepholeé㿠㯠次å ãã¯ãã«ã«ãªãï¼
Gated Recurrent Unit (GRU)
LSTM ã¨ããæ¯è¼ããã GRU ã¯ï¼LSTM block ã«å¹¾ã¤ãã®å¤æ´ãå ãã¦ã·ã³ãã«ãª GRU block ã«ããã¢ãã«ã§ããï¼ã¾ãå ¥åã²ã¼ãã¨å¿å´ã²ã¼ããçµã¿åããã¦æ´æ°ã²ã¼ããã¤ããï¼ã¾ã 㨠ã¯ãã¼ã¸ãã¦ï¼ãããåæåãããªã»ããã²ã¼ããå°å ¥ããï¼
æ´æ°ã²ã¼ã (Update Gate) ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼
ãªã»ããã²ã¼ã (Reset Gate) ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼
é ãç¶æ ã®å¤æã¯æ¬¡å¼ã§è¡¨ãããï¼ ã¯ã¢ããã¼ã«ç©ï¼
LSTM ã GRUï¼å¾è¿°ã® seq2seq ã¯ããããå é¨ã¡ã¢ãªãæã£ããã¥ã¼ã©ã«ãããã§ï¼æ å ±ã®ä¿æã¨å¦çãåæã«è¡ã (e.g., ã¡ã¢ãªã»ã«)ï¼ããã«å¯¾ãï¼Neural Turing Machine (arXiv, 2014/10) ã Memory Networks (arXiv, 2015/3) ã¯å¤é¨ã¡ã¢ãª (External Memory) ãæã£ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ã§ï¼æ å ±ã®ä¿æã¨å¦çãåãé¢ããã¨ã§ RAM ã¡ã¢ãªã®ãããªé·æè¨æ¶ãå¯è½ã«ããï¼Encoder ã®ä¸é層ãä¿æãã注æ (Attention) ã¯ãã®ä¸éã¨ãããï¼
Chainer å®è£
è¨èªã¢ãã« (次ã«ããåèªã®äºæ¸¬) ãä¾ã« LSTM ã®ãã©ã¯ã¼ãè¨ç®ããã¿ã«è¨è¿°ããï¼ç¡è« Chainer æä¾ã®é¢æ° links.LSTM ã functions.LSTM ã使ãã°æ°è¡ã§è¡¨ããï¼ãã¨ã§ãã¼ãã¬ãã·ãã£ã®ç®åºã§ã使ãï¼
class ChainerLSTM(Chain): def __init__(self): super(ChainerLSTM, self).__init__( U = L.EmbedID(VocabSize, hiddenSize), Wa = L.Linear(hiddenSize, hiddenSize), #çç¥ (Wi, Wf, Wo, Ra, Ri, Rf, Ro ã Wa ã¨åæ§) V = L.Linear(hiddenSize, VocabSize), ) def reset(self): self.zerograds() self.cell = Variable(np.zeros((1, hiddenSize), dtype=np.float32)) self.hidden = Variable(np.zeros((1, hiddenSize), dtype=np.float32)) def forward(model, sentence): model.reset() loss = Variable(np.zeros((), dtype=np.float32)) for i in range(len(sentence)): wid = sentence[i] embed = model.U(Variable(np.array([wid], dtype=np.int32))) a = F.tanh(model.Wa(embed) + model.Ra(model.hidden)) inputGate = F.sigmoid(model.Wi(embed) + model.Ri(model.hidden)) forgetGate = F.sigmoid(model.Wf(embed) + model.Rf(model.hidden)) model.cell = inputGate * a + forgetGate * model.cell outputGate = F.sigmoid(model.Wo(embed) + model.Ro(model.hidden)) model.hidden = outputGate * F.tanh(model.cell) y = model.V(model.hidden) nextwid = sentence[i + 1] if (i != len(sentence) - 1) else eosID target = Variable(np.array([nextwid], dtype=np.int32)) loss += F.softmax_cross_entropy(y, target) return loss
è¿æ³
Quasi-Recurrent Neural Network (QRNN) (arXiv, 2016/11) ã§ã¯ï¼æ¢åã® LSTM ã¨åçããã以ä¸ã§ãããªããï¼16åé«éã«å¦ç¿ã§ããææ³ãææ¡ãããï¼æè¿ï¼èªç¶è¨èªå¦ç㧠Convolution ã使ãç 究ãå¢ãã¦ããï¼New neural network building block allows faster and more accurate text understanding ãåç §ããããï¼
Neural Architecture Search with Reinforcement Learning (OpenReview, 2016/11) ã¯çæ°ãã¿ã姿ã§ããï¼RNN ã«éããï¼æé©ãªãããã¯ã¼ã¯æ§æãå¼·åå¦ç¿ã§æ¢ç´¢ãã試ã¿ã§ï¼ç»åèªèã®ã¨ã©ã¼çã§ã¯è¿å¹´ SOTA ãå©ãåºãã DenseNet ã Wide ResNet ã¨è¯ãåè² ã ãï¼ãã¼ãã¬ãã·ãã£ã§ã¯ SOTA ã® LSTM ãåé§ãã¦ããï¼800 ãããã¯ã¼ã¯ã 800 GPU ã§å¸¸æè¨ç·´ï¼ãã£ã±ãçæ°ï¼ä¸å³ã§ã¯ LSTM block ã£ã½ãä½ããçæãã¦ããï¼
RNN ã®ããããã¢ã¦ãã¨ãããæ£è¦å
RNN ã®å¦ç¿ãæ¹åããææ³ãå¹¾ã¤ãç´¹ä»ãããï¼
ããããã¢ã¦ã
ããããã¢ã¦ã (Dropout) ã¯é ãã¦ãããããã«ãã¼ã¤åå¸ã«å¾ã£ã確ç (50%ãªã©) ã§ã©ã³ãã ã«ç¡è¦ããææ³ã§ï¼ã¢ã³ãµã³ãã«å¦ç¿ã«ä¼¼ãå½¹å²ãæããï¼Dropout ã«ã¯æ£ååã®å¹æãããï¼è¨ç·´ãã¼ã¿ã¸ã®éå¦ç¿ãæãããªãã¼ã·ã§ã³ãã¼ã¿ (ãã¹ããã¼ã¿) ã¸ã®æ±åæ§è½ãæ¹åããã¨è¨ããã¦ããï¼ãã¹ãæã¯ãã©ã¡ã¼ã¿ã« Dropout ã§ä½¿ã£ã確çããããï¼
RNN ç³»èã¸ã®åç´ãª Dropout é©å¿ã¯ã¡ã¢ãªã»ã«ã®åãã妨害ããªãããã«ããããã«ï¼æéæ¹åã®ãªã«ã¬ã³ãæ¥ç¶ ã«ä½¿ç¨ã§ããªãï¼ä¸å³ã®ããã«éãªã«ã¬ã³ãæ¥ç¶ã«ä½¿ç¨ã§ããï¼
- Recurrent Neural Network Regularization (arXiv, 2014/9)
å ¥åã¨3ã¤ã®ã²ã¼ãã®ç·å½¢å¤æã«ãããããããã¢ã¦ã ã®é©å¿ã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼
å¤åRNN (Variational RNN) ã§ã¯åæéã¹ãããã§åã Dropout ããã¹ã¯ããææ³ã§ããï¼ä¸å³ã®ç°ãªãç¢å°ã®è²ã¯ç°ãªã Dropout ã®ãã¹ã¯ã§ããï¼å¤å RNN ã§ã¯ãã©ã¡ã¼ã¿ãå ±æããã¬ã¤ã¤ã¼ã«ã¯åã Dropout ãé©å¿ãã¦ããã®ããããï¼å¤åRNN ã«ãããã¼ãã¬ãã·ãã£ã®æ¹åã¯è¥å¹²ãªã®ã§éãªã«ã¬ã³ãæ¥ç¶ã® Dropout ã§ã ãååãããããªãï¼
ãããæ£è¦å
å層ã®å ¥åãã¼ã¿ã®åå¸ã¯ããããä¸å±¤ã®ãã©ã¡ã¼ã¿æ´æ°ã«ãã£ã¦å¤åããï¼å層ã®å¾é ã¯ãããããå ã®å¹³åãã¨ã£ã¦æ¨å®ãããï¼ãã®åå¸ã®å¤åã«ãã£ã¦ããããããã¨ã«ç°ãªããã¤ã¢ã¹ãä¹ã£ã¦ãã¾ãå¦ç¿ãä¸å®å®ã«ãªãã¨ããåé¡ãããï¼ãããå é¨å ±å¤éã·ãã (Internal Covariate Shift) ã¨ããï¼
ã¨ããã§æ£è¦å (Normalization) ã¨ã¯å¦ç¿ãã¼ã¿ã®åãµã³ãã«ãå¹³å ï¼åæ£ ã«æããåå¦çã®ãã¨ã§ï¼ä¾ãã°æ£è¦åãããç»åã¯æããã®ã°ãã¤ããå°ãããªãï¼åæ§ã®å¡©æ¢ ã§ï¼ãããæ£è¦å (Batch Normalization) ã§ã¯å層ã®å ¥åãã¼ã¿ã®åå¸ãåã¦ãããã¨ãããããããã¨ã«å¹³å ï¼åæ£ ã«æããææ³ã®ãã¨ã§å é¨å ±å¤éã·ãããæå¶ã§ããï¼ã¾ããããæ£è¦åã¯æ£ååã®å¹æãããã®ã§ï¼Dropout ã L1, L2 æ£ååã®ä»£ç¨ã¨ãªãï¼
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (arXiv, 2015/2)
ããããã ã«å«ã¾ãã ã®å¤æã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼ãã®ã¨ãå¦ç¿ãã©ã¡ã¼ã¿ã ã¨ãã (ã¹ã±ã¼ã«ã¨ã·ããã®å½¹å²ã§ãããã¯ã¼ã¯ã®è¡¨ç¾åãããã)ï¼
å·¦ä¸å³ã¯ãããæ£è¦åã®æ¯è¼ï¼Inception ã¯ãããæ£è¦åãªãã§ï¼BN ã¯ãããæ£è¦åããã§ããï¼BN- ã®ãã¨ã®æ°åã¯å¦ç¿çã®åçã§ããï¼Baseline ã®å¦ç¿ç㯠ã§ï¼x5 ãªã ï¼x30ãªã ã¨ãªãï¼ãããæ£è¦åããã¨å¦ç¿çãä¸ãã¦ãå¦ç¿ãå®å®ããã®ã§å¦ç¿é度ãæ©ããªãã¨ãããï¼
ãããæ£è¦å | ãªã«ã¬ã³ããããæ£è¦å |
---|---|
ãªã«ã¬ã³ããããæ£è¦å (Recurrent Batch Normalization) ã¯ï¼ãã®ãããæ£è¦åã LSTM ã®å é¨ã§è¡ãææ³ã§ããï¼è«æã§ã¯ MNIST ã®ç»å ( ãã¯ã»ã«) ã 1ã¹ãããã« 1ãã¯ã»ã«ãã¤å ¥åãæ£è§£ã®ã©ãã«ãäºæ¸¬ãã Sequential MNIST (pixel by pixel MNIST) ã¨ããã¿ã¹ã¯ã§å®é¨ãè¡ã£ã¦ããï¼çµæã¯å³ä¸å³ã§ Batch-normalized LSTM ã®æ¹ãæ©ãåæãã¦ããï¼
- Recurrent Batch Normalization (arXiv, 2016/3)
Batch-Normalized LSTM ã§ã¯ï¼å ¥åãé ã層ã¨é ã層ãé ã層ã®2ç®æã«é©å¿ãã¦ããï¼ç°¡ç¥åã®ããã«ãããæ£è¦å ã次å¼ã®ããã«å®ç¾©ããï¼
ãªã«ã¬ã³ããããæ£è¦åã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼ãã ãï¼å ¥åã¨3ã¤ã®ã²ã¼ãã®ç·å½¢å¤æã¯1ã¤ã«ã¾ã¨ãã¦è¡¨è¨ãã¦ããï¼
ä½è«ã ãï¼ãããæ£è¦å以éï¼éã¿ãæ£è¦åãã Weight Normalizationï¼åã¦ãããã¨ã層ãã¨ã«æ£è¦åãã Layer Normalizationï¼ãããæ£è¦åããã¼ã¹ã«åãããããã®çµ±è¨éããæ±ããã¹ã±ã¼ã« ã¨ã·ãã ã追å ãã Batch Renormalization ãªã©æ§ã ãªæ£è¦åææ³ãææ¡ããã¦ããï¼
- Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (arXiv, 2016/2)
- Layer Normalization (arXiv, 2016/7)
- Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models (arXiv, 2017/2)
ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ (NMT)
æ©æ¢°ç¿»è¨³ã®ä¸»æµã§ããçµ±è¨çæ©æ¢°ç¿»è¨³ (SMT, Statistical Machine Translation) ã¯ï¼åè¨èªãä¸ããæã«å¯¾è¨³ã®å°¤åº¦ãæ大ã¨ãªã確çã¢ãã«ãå¦ç¿ãã¦ç®çè¨èªã«ç¿»è¨³ããã·ã¹ãã ãæãï¼
Google 翻訳ã®ã¢ããã°ã¬ã¼ãã§ã話é¡ã«ãªã£ããã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ (NMT, Neural Machine Translation) ã¯ï¼çµ±è¨çæ©æ¢°ç¿»è¨³ã§å¦ç¿ãã確çã¢ãã«ã¨ãã¦ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã使ã£ãææ³ã§ããï¼ä¸ã§ãã¨ã³ã³ã¼ãã¼ã»ãã³ã¼ãã¼ç¿»è¨³ã¢ãã« (Encoder-Decoder Model) ã¯ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã®ã¿ã§ç¿»è¨³å¯è½ã§å¤§ããªæ³¨ç®ãéãã¦ããï¼
Sequence to Sequence (seq2seq)
seq2seq (Encoder-Decoder Model ã¨å義) 㯠LSTM block ã ï¼ã¤ç¨ãã¦ï¼å ¥åãå¦çããã¨ã³ã³ã¼ãã¼ã¨åºåãçæãããã³ã¼ãã¼ãçµã¿åãããã¢ãã«ã§ãã (åèè«æã§ã¯ GRU block ã使ç¨)ï¼ä¸è¬çã«ã¨ã³ã³ã¼ãã¼ã¨ãã³ã¼ãã¼ã® LSTM block ã¯ç°ãªããã©ã¡ã¼ã¿ã ãï¼ãã©ã¡ã¼ã¿ãå ±æãã¦ã (LSTM block 1ã¤ã§ã) åé¡ãªãï¼ãããã¯ããããå¤å±¤ã® LSTM block ã§ãè¯ãï¼ç³»åãã¼ã¿ããç³»åãã¼ã¿ãçæããã¢ãã«ãªã®ã§ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã®ã»ãï¼ãã¥ã¼ã©ã«å¯¾è©±ã¢ãã« (NCM) ãæç« è¦ç´ã«ã使ãããï¼
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (arXiv, 2014/6)
- Sequence to Sequence Learning with Neural Networks (arXiv, 2014/9)
å¦ç¿ã¯å¯¾è¨³ã®ã㢠ã«ãã£ã¦è¡ãï¼
Encoder ã¸ã®å
¥åæã ï¼Decoder ããã®åºåæã ï¼å¯¾è¨³æ (æ師ãã¼ã¿) ã ã¨ããï¼
ãã ãï¼ ã¯ææ«è¨å· EOS (End of Sentence)
ã§ããï¼
ã¾ãï¼ ã Encoder ã®åã¹ãããã® LSTM block ã«é ã«å ¥åãã¦ããï¼ãã®ã¨ã LSTM block ã«ã¯åºå層ã¸ã®åºåã¯ãªãï¼LSTM block 㮠㨠ã次ã¹ãããã® LSTM block ã«æ¸¡ãã ãã§ããï¼ããã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼
æå¾ã« ãå
¥åããæã® Encoder 㮠㨠ã Thought Vector ã¨ãã¦ï¼Decoder ã® LSTM block ã«æ¸¡ãï¼ããã¦ï¼Decoder ã®æåã® LSTM block ã« GO
ãå
¥åãï¼ãã®æã®åºå ã¨å¯¾è¨³ ã®èª¤å·®ãæ失ã¨ãªãï¼æ¬¡ã¹ãããã® LSTM block ã¸ã®å
¥åã¯å¦ç¿ã§ã¯ , æ¬çªã§ã¯ ã¨ãªãï¼ãã®æã®åºå ã¨å¯¾è¨³ ã®èª¤å·®ãæ失ã¨ãªãï¼ãããã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼
ãããç¹°ãè¿ã㦠㨠ã®æ失ã¾ã§ç´¯ç©ããï¼ãã®ç´¯ç©æ失ã Back-Prop ãã¦ãã©ã¡ã¼ã¿æ´æ°ãè¡ãï¼ã¢ãã«ã®å¦ç¿ã¯æ¡ä»¶ä»ã確çåå¸ã ã¨ãããæï¼æ¬¡å¼ã®ãããªæ¡ä»¶ä»ã対æ°å°¤åº¦ã®æ大åã¨ãã¦è¡¨ãããï¼
ãã±ãã¨ããã£ã³ã°
TensorFlow ã§ã¯ï¼ãã±ã (bucketing) ã¨ããã£ã³ã°ãç¨æããã¦ããï¼
ãããããå¦ç¿ã§ã¯å
¥åºåé·ãæããªããã°ãªããªãï¼ããã£ã³ã°ã¯ãããããã¢ãã«ã®å
¥åºåé·ãåºå®å¤ (ä¾ãã°ï¼å
¥åé·: 5, åºåé·: 10) ã«å®ãï¼å
¥åºåæã¯ç©ºç½ã PAD
ã§åããï¼ã¡ãªã¿ã«å
¥åæã¯å転ãã¦å
¥åããæ¹ã精度ãè¯ããªãï¼
["I", "go", "."] => ["Je", "vais", "."] [PAD PAD "." "go" "I"] => [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]
æãçãã¨ãã«ä¸å¿
è¦ã«å¤ãã® PAD
åããé²ãããï¼ãã±ãã¯å
¥åºåé·ãæ°ç¨®é¡ã«åºå® (ä¾ãã°ï¼) ãã¦ï¼ç¨®é¡æ°åã® seq2seq ã¢ãã«ãç¨æããï¼
詳細㯠TensorFlow Tutorials - Sequence-to-Sequence Modelsãåç §ããããï¼
èªå½æ°å¢å¤§ã¸ã®å¯¾å¿
ãã¼ã¿ã»ããã®å
¨èªå½ãå
¥åºå層ã«ä½¿ãã¨ï¼ãã¼ã¿ã»ããã®å¯¾è¨³ææ°ãå¢ããã»ã©èªå½æ°ãå¢ããã®ã§å
¥åºå層ã大ãããªãï¼ããã¯è¨ç®æéã®å¢å¤§ãã¯ã©ã¹åé¡ã®ç²¾åº¦ä½ä¸ãæãã®ã§ï¼ç°¡æã«ã¯èªå½æ°ãå¶éããã®ãæã¾ããï¼èªå½æ°ã«ä¸éãè¨ãï¼åºç¾é »åº¦ãä½ãèªå½ã UNK
ã§åãã¦ãã¾ãã°è¯ãï¼
大è¦æ¨¡ã«èªå½ã使ãå ´åï¼è¨ç®æéã®ããã«ããã¯ã¯æ¬¡å¼ã®ãããªã½ããããã¯ã¹é¢æ°ã®åæ¯ã®è¨ç®éã«èµ·å ããï¼
ããã§ãµã³ããªã³ã°ãªã©ã§ã½ããããã¯ã¹é¢æ°ã®åæ¯ãè¿ä¼¼ããææ³ãªã©ãå¹¾ã¤ãææ¡ããã¦ããï¼
- Hierarchical Softmax (paper, 2005)
- Noise Constrastive Estimation (paper, 2012/2)
- Negative Sampling (arXiv, 2013/10)
- Sampled Softmax (arXiv, 2014/12)
- BlackOut (arXiv, 2015/11)
ä½è«
æç« è¦ç´ã¯ FaceBook (arXiv, 2015/9), IBM (Watson) (arXiv, 2016/2), Google Brain (project page, 2016/8) ãã主è¦ãªè«æãåºã¦ããï¼ã¾ãï¼ãã£ããããã㯠ãããªãå¾¹åºè§£å ã 対話botã®æè¡ ããããåèã«ãªãï¼è¿å¹´ã¯ seq2seq ã対話ã¢ãã«ã«å¿ç¨ãããã¥ã¼ã©ã«å¯¾è©±ã¢ãã« (Neural Conversation Model) ãç®ç«ã¤ããã«ãªã£ãï¼ããã¯æ師ãã¼ã¿ã対訳æããè¿çæã«å¤ããã ãã§ï¼ã¢ãã«èªä½ã¯ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã¨åãï¼
- A Neural Conversational Model (arXiv, 2015/6)
Encoder-Decoder ã¢ãã«ã§ä½ãããä¸é層ã word2vec ã®ãããªæ çµã¿ã§æç« ã®åæ£è¡¨ç¾ãæ±ããææ³ã« Skip-Thought Vectors ãããï¼
- Skip-Thought Vectors (arXiv, 2015/6)
- Skip-Thought Vectors ã解説ãã¦ã¿ã (解説ããã°)
注æ (Attention)
å·¦ä¸å³ã¯é常㮠Encoder-Decoder Model ã®å ¥åºåé·ã¨å¾è¿°ã® BLUE ã¹ã³ã¢ã¨ã®é¢ä¿ã§ï¼ã¹ã³ã¢ãé«ãã»ã©ç¿»è¨³ç²¾åº¦ãè¯ããã¨ã表ãï¼å ¥åºåé·ã 20 ãè¶ ããããããã翻訳精度ãè½ã¡ã¦ãããï¼æ°ç¾æ¬¡å ã® ã ã ãã§é·æãè¨æ¶ãããã®ã¯ç¡çããããï¼ç¡è«ï¼ãããã®æ¬¡å æ°ãå¢ããã°ä»åº¦ã¯éã¿è¡åã ãªã¼ãã¼ã§è¨ãä¸ããè¨ç®ã³ã¹ãã大ãããªã£ã¦ãã¾ãï¼å³ä¸å³ã¯æ³¨æ (Attention) ã¢ãã«ã®å ¥åºåé·ã¨ BLUE ã¹ã³ã¢ã¨ã®é¢ä¿ã§ããï¼RNNsearch ã Attention Model ã§ããï¼ãã®å¾ã®æ°åã¯å¦ç¿ãã¼ã¿ã®é·ãã表ãï¼å¦ç¿ãã¼ã¿ãé·æã®å ´åï¼Attention Model ã§ã¯é·æã®ç¿»è¨³ã§ãã¹ã³ã¢ãè½ã¡ãªãã®ããããï¼
Vanilla Model (non-Attention) | Attention Model |
---|---|
注æ (Attention) ã¯ï¼å ¥åºåã®ã½ãããªã¢ã©ã¤ã¡ã³ããå¦ç¿ãï¼ç¿»è¨³ã«é¢ä¿ããé¨åã«çç®ããªãã翻訳ãè¡ãææ³ã§ããï¼Encoder å´ã®åã¹ãããã®ä¸é層ããã¹ã¦è¨é²ãï¼ã彼女ã her ã«å¯¾å¿ãããã¨ãã£ãåèªã¢ã©ã¤ã¡ã³ããæèæ å ±ã Decoder å´ã«èæ ®ããããã¨ã§ã¨ã³ã³ã¼ãã¼ã»ãã³ã¼ãã¼ç¿»è¨³ã¢ãã«ã®å¼±ç¹ã§ãã£ãé·æã®ç¿»è¨³ãæ¹åã§ããï¼
- Neural Machine Translation by Jointly Learning to Align and Translate (arXiv, 2014/9)
- Effective Approaches to Attention-based Neural Machine Translation (arXiv, 2015/8)
ã¾ã Encoder å´ã®ä¸é層 ããã¹ã¦è¨é²ãï¼ããããç¾æç¹ã® Decoder ã®ä¸é層 ã¨ã®å ç© (ã¹ã³ã¢ã¨ãã) ãæ±ãã½ããããã¯ã¹é¢æ°ã§æ£è¦åã Alignment Weight Vector ãã¤ãã (ä¸å³åç §)ï¼
ã¡ãªã¿ã«ã¹ã³ã¢ã¯ä¸è¬çã«å ç© (dotç©) ã ãè«æä¸ã§ã¯ä»¥ä¸ã®ãã㪠general ã concat ãææ¡ããã¦ããï¼ ã¨ ã¯ã¢ãã«ã®ãã©ã¡ã¼ã¿ã§ããï¼
ãã® ã Encoder å´ã®ä¸é層ã«ããããæãããã¯ãã«ã®ç·åã Context Vector ã¨ããï¼ããã¨ç¾æç¹ã® Decoder ä¸é層 ãã¹ã¿ãã¯ã«é£çµ (concat) ããï¼éã¿è¡å ãããã¦æ´»æ§åé¢æ° ã«éãããã¯ãã« ãæçµçãªä¸é層ã®åºåã¨ããï¼
ç¢ç«ï¼ç¾æç¹ã®ä¸é層 (Text ãªæ å ± ) ã«æ³¨æãã¹ãä¸é層ã®æ å ± (Context ãªæ å ± ) ãåä½ãããï¼åç´ãª Attention å°å ¥ã«ãã£ã¦è¿½å ã§å¦ç¿ãã¹ããã©ã¡ã¼ã¿ã¯æå¾ã®éã¿è¡å ã ãã§ããï¼ä»¥ä¸ã®èª¬æ㯠Global Attentional Model ã¨å¼ã°ããã¿ã¤ãã§ããï¼
Global Attentional Model | Local Attentional Model |
---|---|
Local Attentional Model ã§ã¯ããã« Aligned Position ã追å ããï¼ãã ãï¼ ã¨ ã¯ã¢ãã«ã®ãã©ã¡ã¼ã¿ã§ï¼ ã¯å ¥åæã®é·ãã§ããï¼ãã£ã¦ï¼ ã¨ãªãï¼
次㫠Aligned Position ãåªå ããããã« ãä¸å¿ã¨ããã¬ã¦ã¹åå¸ãé ç½®ãï¼æ¬¡å¼ã®ããã« Alignment Weight Vector ãã¤ããï¼
ãã®å¾ã¯ Global Attentional Model ã¨åæ§ã®æç¶ãã§ããï¼
解説㯠Yuta Kikuchi ããã® æè¿ã®Deep Learning (NLP) çéã«ãããAttentionäºæ
ã詳ããï¼
ãã¸ã¥ã¢ã«ã§ç解ãããªã Attention and Augmented Recurrent Neural Networks ãããããã§ããï¼
åæ¹åã¨ã³ã³ã¼ãã¼ã»å¤å±¤LSTM
åæ¹åã¨ã³ã³ã¼ãã¼ (Bi-directional Encoder) ã¯ï¼é常ã®ã¨ã³ã³ã¼ãã¼ (RNN) ã®ä¸é層ã«å ¥åæãéåãã«å ¥åããã¨ã³ã³ã¼ãã¼ (RNN) ã®ä¸é層ãé£çµ (concat) ããï¼ããã«ãã£ã¦ã¨ã³ã³ã¼ãã¼ã®åæã«ããã¦å°æ¥ã®åèªã«é¢ããæ å ±ã追å ã§ããï¼
éæ¹åã® RNN ã¨é æ¹åã® RNN ã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼ ã¯ LSTM block ã GRU block ã表ãï¼
,
é£çµã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼
Google 翻訳ã®åæ¹åã¨ã³ã³ã¼ãã¼ã§ã¯ï¼é£çµå¾ãã¨ã®ãµã¤ãºã«æ»ãããã« ãååã®ãµã¤ãºã«ãã¦ããï¼
å¤å±¤ LSTM (Stacked LSTM) 㯠LSTM block ãç©ã¿éãã¦æ·±å±¤åããã¢ãã«ã§ããï¼MNIST ã§ä½¿ãå¤å±¤ãã¼ã»ãããã³ã®ããã«ï¼å層ã§ç°ãªããµã¤ãºã®æ å ±ã表ç¾ã§ããï¼seq2seq ã®åè ã§ã¯ 4層㮠Stacked LSTM ã使ç¨ãã¦ããï¼ã¾ã TensorFlow ã Keras ã®ãã¥ã¼ããªã¢ã«ã«ãæ²è¼ããã¦ããï¼
TensorFlow ãã¥ã¼ããªã¢ã« | Keras ãã¥ã¼ããªã¢ã« |
---|---|
è¨èªã¢ãã«ã®è©ä¾¡ ãã¼ãã¬ãã·ãã£
è¨èªã¢ãã«ã®è©ä¾¡å°ºåº¦ã«ã¯å¹¾ã¤ããããï¼å°¤åº¦, 対æ°å°¤åº¦, ã¨ã³ãããã¼, ãã¼ãã¬ãã·ãã£ãããã代表çã§ããï¼
尤度ã¯ï¼ã¢ãã« ãä¸ããããæã®ãã¹ããã¼ã¿ ã®ç¢ºçã§æ¬¡å¼ã®ã¨ããï¼
exp ã¯æ¡æ°ãå¤ãããã¨å¤ãã¨ã¶ï¼ä¾ãã° math.exp(710.0)
㯠math range error
ã«ãªãï¼å°¤åº¦ãè² ã®å¤ççºãèµ·ããã®ã§å¯¾æ°å°¤åº¦ã使ãããï¼
ã¨ã³ãããã¼ ã¯ï¼åº ã®è² ã®å¯¾æ°å°¤åº¦ãåèªæ° ã§å²ã£ãå¤ã§æ¬¡å¼ã§è¡¨ãããï¼ã¨ã³ãããã¼ã大ããã»ã©æ¬¡ã®åèªã®äºæ¸¬ãå°é£ï¼ä¸ç¢ºå®ï¼ã§ããï¼
ãã¼ãã¬ãã·ã㣠(Perplexity) ã¯ï¼ ã®ã¨ã³ãããã¼ä¹ï¼ãã®å¤ãå°ããã»ã©åªããã¢ãã«ã ã¨ãããï¼
ä¾ãã°ï¼RNNLM ãè©ä¾¡ããã«ã¯ã¢ãã« ãã ãçèµ·ããã確ç ãè¨ç®ãã¦ï¼ãã¼ãã¬ãã·ãã£ãæ±ããã°ããï¼RNNLM ã®åºå ã¯ï¼æ½è¡ ã«ãããè¾æ¸ã®å ¨åèªã®ç¢ºçåå¸ (ãã ã ã¯è¾æ¸ã®åèªæ°) ã§ããï¼åèª ã®ç起確ç 㯠ã¨è¡¨ãããï¼
åè¿°ã® LSTM ã® Chainer å®è£ ãããã¼ãã¬ãã·ãã£ãæ±ããã«ã¯æ¬¡ã®ã¨ããï¼
def PPL(model, sentence): sum = 0.0 # çç¥ (cell, hidden ã® Variable å) for i in range(len(sentence) - 1): pred, target = sentence[i], sentence[i + 1] embed = model.U(Variable(np.array([pred], dtype=np.int32))) # çç¥ (ãã©ã¯ã¼ãè¨ç®ãè¨è¿°) y = F.softmax(model.V(hidden)) p = y.data[0][target] sum -= math.log(p, 2) # math.log(x, y) ã¯çæ° x, åº y return sum f = 0.0 w = 0 for sentence in range(len(testData)): f += PPL(model, sentence) w += len(sentence) ppl = math.pow(2, f / w) # math.pow(x, y) 㯠x ã® y ä¹
ã¡ãªã¿ã«è¨èªã¢ãã«ã®è§£èª¬ã¯ Graham Neubig æ°ã® 1-gram è¨èªã¢ã㫠㨠n-gram è¨èªã¢ãã« ãæ¾æ¬ç ã® è¨èªã¢ãã« ãè¯ãï¼ç¹ã«å¾è ã§ã¯ã¨ã³ãããã¼ã®æå°åã Kullback-Leibler ãã¤ãã¼ã¸ã§ã³ã¹ ã®æå°ååé¡ã«ç½®ãæãã¦èª¬æãã¦ããï¼
æ©æ¢°ç¿»è¨³ã®è©ä¾¡ BLEU
æ©æ¢°ç¿»è¨³ã®è©ä¾¡ææ³ã¯ãã¼ãã¬ãã·ãã£ã®ã»ãï¼n-gram é©åçã«ãã£ã¦ã¨ã³ãããã¼ãæ±ãã BLEU ãä¸è¬çã«ç¨ããããï¼NMT ç³»ã®è«æã§ã¯ããã¦ã BLEU ã®è©ä¾¡ãè¼ãã¦ãï¼BLEU ã¯ï¼å±æçã«æµæ¢ãªæãæ£è§£æã¨è¡¨ç¾æ³ãä¸è´ããæã«é«è©ä¾¡ãä¸ãããï¼æå³çãªå¦¥å½æ§ã¨ã®ç¸é¢ãä½ãã¨ããåé¡ãããï¼
- BLEU: a Method for Automatic Evaluation of Machine Translation (paper, 2002/7)
- æã¬ãã«ã®æ©æ¢°ç¿»è¨³è©ä¾¡å°ºåº¦ã«é¢ããèª¿æ» (paper, 2013)
- èªåè©ä¾¡ (解説ãµã¤ã)
ã¢ãã«ã®åºåæã ï¼æ£è§£æ (åç §æ) ã ã¨ããï¼ãã®ã¨ã n-gram é©åç ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼(n-gram ã®ãã¤ãã³ã§ãªããMathJaxããã°ã...)
ãªã n-gram ã®é¨å㯠4-gram ã«ãªãï¼
ã¾ã n-gram ã¯é·ãnã®åèªåãªã®ã§ï¼ ã¨è¡¨ãããï¼
ã«å«ã¾ãã n-gram ã ã«ãå«ã¾ããæ°ã ã¨ããã¨ï¼n-gram ä¸è´æ° ã¯æ¬¡å¼ã®ããã«è¡¨ãããï¼ ã¯æ£è§£æ (åç §æ) ã® n-gram ã®æ大å¤ï¼
Brevity Penalty (BP) ã¯ï¼ç¿»è¨³ã®åºåæã®é·ã ãæ£è§£æã®é·ã ãããçãå ´åã®ããã«ãã£ã§ããï¼çæ㧠n-gram é©åçãæé©åããã®ãåé¿ããï¼
ãã®ã¨ãï¼BLEU ã¯æ¬¡å¼ã®ããã«å®ç¾©ãããï¼å¤å㯠~ ã§ã¹ã³ã¢ã大ããã»ã©è©ä¾¡ãé«ãï¼%表è¨ãå¤ãã®ã§ï¼ãã®å ´åã®å¤å㯠~ ã¨ãªãï¼
ãã®å¯¾æ°ã¯æ¬¡å¼ã®ã¨ããï¼
ãã ãï¼
ã§ãããããªæ大㮠ã ã¨ãã¦ä½¿ç¨ããï¼å è«æã®ãã¼ã¹ã©ã¤ã³ã§ã¯
arXiv ã§è¿½ãææ°ã®ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³
æ´¾æã§ãªããã°æ©æ¢°å¦ç¿ãããªãï¼ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã¯ç«åã ãï¼
ã¨ããããã§æè¿ (2016å¹´9æ以é) ã§æ´¾æã ã¨ããã£ãææ³ãããã¤ãç´¹ä»ãããï¼
1: Google ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ (GNMT)
- Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (arXiv, 2016/9)
- Peeking into the neural network architecture used for Google's Neural Machine Translation (解説ããã°)
- G社ã®NMTè«æãèªãã§ã¿ã (SlideShare)
Google ã®ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ (GNMT) ã¯ï¼Enc-Dec ã® Enc 㨠Dec ã縦æ¹åã«ç©ã¿éã深層åã㦠Residual ãæãã§ã¦è§£èª¬è ã "ã¢ã³ã¹ã¿ã¼" ã£ã¦è©ãã¦ããã©ç¦¿åããããªï¼https://t.co/jSJyOQqq8e pic.twitter.com/8NDCXnO1qX
— Ryobot | ããã¼ã£ã¨ (@_Ryobot) 2016å¹´11æ18æ¥
ãã¼ã¹ã¯ Attention ä»ãã® Encoder-Decoder Model + åæ¹åã¨ã³ã³ã¼ãã¼ï¼8層ã®å¤å±¤LSTMã¨ããæ´¾æããç¹å¾´ï¼åç´ãªç©å±¤ã§ã¯4層ãè¯ãï¼8層ã§ã¯å¾é æ¶å¤±çºæ£ããã®ã§ Residual connection (å ¥åã®æçãªå ç®ï¼ãã¤ã¼ãå³ã® Sum層) ãåãå ¥ãã¦ããï¼Attention ã¯å¤å±¤ã® Encoder ã®å層ã«å¯¾ãã¦è¡ã£ã¦ããï¼
éåæ (Async) ã®ãã¼ã¿ä¸¦å (12åå²) ã¨å層㫠GPU ã 1ã¤å²ãæ¯ã£ãã¢ãã«ä¸¦å (8ã¢ãã«) ã§å¦ç¿ãã¦ãã (ãããããã¤ãã©ã¤ã³)ï¼softmax層ãèªå½ãã¨ã«ç°ãªã GPU ãå²ãæ¯ã£ã¦ããï¼ã¡ãªã¿ã« Async ã¨ã¯åã¯ã¼ã«ã¼ããã©ã¡ã¼ã¿ãµã¼ãã¼ããã¢ãã«ãåãåãï¼å¾é ã®è¨ç®ãçµãã次第åã¯ã¼ã«ã¼ãåã ã«ãã©ã¡ã¼ã¿ãµã¼ãã¼ã«å¾é ã渡ãã¦éåæã«ã¢ãã«ãæ´æ°ããææ³ãããï¼Google 㯠Async ã好ããªãã㧠TensorFlow ã®éå½¢ã§ãã DistBelief ã Async ã§ããï¼
- é«éåã®ãã㫠㨠ã®å¤ã ã«ã¯ãªããã³ã° (ãã«ã ã«ä¸éãè¨å®)ï¼softmax ã¸ã®å ¥åã () ã«ã¯ãªããã³ã°ï¼å¤ççºåé¿ã®ããã®å¾é ã¯ãªããã³ã°ã¯ä¸æï¼
- é«éåã®ããã« LSTM å ã®éã¿è¡åã 8bit æ´æ°ã«éåå (WaveNet ã§ä½¿ã£ã μ-lawã¢ã«ã´ãªãºã ï¼)
- è¨ç®æ©ã¯ Nvidia Tesla K80 ã 96æ
- ãã¼ã¿ã»ãã㯠WMT EnâFr (3600ä¸å¯¾è¨³æ), WMT EnâDe (500ä¸å¯¾è¨³æ), Google 社å ã®ãã¼ã¿ (æ°ã¯ä¸æ)
2: Zero-Shot 翻訳å¯è½ãªå¤è¨èªç GNMT
- Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation (arXiv, 2016/11)
GNMT ã®ç¶ç·¨ã§ãããããGoogle翻訳ãã« 11æãã使ããå§ããææ³ï¼æ®éã®æ©æ¢°ç¿»è¨³ã§ã¯å¯¾è¨³ã®2è¨èªéã§1ã¤ã®ã¢ãã«ãå¦ç¿ãããï¼å¤è¨èªç GNMT ã§ã¯ã¢ãã«ã®ãã©ã¡ã¼ã¿ãå
±æãã¦å¤è¨èªéã§1ã¤ã®ã¢ãã«ãå¦ç¿ãï¼æ®éçãªç¿»è¨³ç¥èãç²å¾ãããããï¼å
¥åæã®å
é ã«ã¿ã¼ã²ããè¨èªã®ãã¼ã¯ã³ (e.g., <2es>
ãªãã¹ãã¤ã³èªã«ç¿»è¨³) ãæå®ããã°ï¼æªå¦ç¿ã®è¨èªéã§ã翻訳ãå¯è½ (Zero-Shot 翻訳)ï¼ä¾ã«æ¼ãã 100GPU 㧠3é±éã®å¦ç¿ï¼æ©æ¢°å¦ç¿ã¯ãã¯ã¼ã ãï¼
life after google brain will require some adjustment ... pic.twitter.com/CZE0kNmqKC
— hardmaru (@hardmaru) 2016å¹´10æ5æ¥
3: æåã¬ãã«ã®ç³ã¿è¾¼ã¿ã«ãããã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³
ãã¼ã¹ã¯ Attention ä»ãã® Encoder-Decoder ã¢ãã«ã ãï¼Encoder é¨å (ä¸å³) ãç¹æ®ã§ï¼Character-Aware Neural Language Models (arXiv, 2015/8) ã«ã¤ã³ã¹ãã¤ã¢ããã¦ããï¼ãã㯠CNN ã®ãããªæç¶ãã§è¨èªã¢ãã«ãæ§ç¯ãã¦ããï¼
ã¾ãå ¥åæãã¹ã¦ãæåã¬ãã«ã®åæ£è¡¨ç¾ã«å¤æãï¼ä½ç½ãããã£ã³ã°ã§åãã¦ããè¤æ°ãã£ãã«ã®ã«ã¼ãã«ã移åããã¦ç³ã¿è¾¼ã¿ãè¡ãï¼æ¬¡ã«çééã®å¹ ã§æ大å¤ãã¼ãªã³ã°ãè¡ã Segment Enbeddings ãæ±ããï¼ããã 4層㮠Highway Network (arXiv, 2015/5) ã«å ¥åããï¼Highway network ã¯æ¬¡å¼ã®ãããªã²ã¼ãã§å ¥å ãåºå ã«å¤æããææ³ã§ããï¼
ãã®å¾ï¼åæ¹å RGU ã«å ¥åãï¼åºåã Encoder ã®ä¸é層ã¨ãã¦æ±ãï¼
æåã¬ãã«ã®ç³ã¿è¾¼ã¿ã¯å¹¾ã¤ãå è¡ç 究ãããï¼è©³ç´°ã¯èªç¶è¨èªå¦çã«ãããç³ã¿è¾¼ã¿ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ãç解ãããåç §ããããï¼
4: ByteNet (Dilated Convolutions ã使ã£ãç³ã¿è¾¼ã¿)
- Neural Machine Translation in Linear Time (arXiv, 2016/10)
DeepMind 製ï¼Encoder (å³ã®ä¸å´ã®ã½ã¼ã¹ãããã¯ã¼ã¯) 㨠Decoder (å³ã®ä¸å´ã®ã¿ã¼ã²ãããããã¯ã¼ã¯) ã« Dilated CNN ã使ã£ãã¢ãã«ã§ï¼æåã¬ãã«ã®è¨èªã¢ãã«ã§ SOTA ã§ããï¼ãããããããã¯ã¼ã¯ã®æ¥ç¶ãç³»åæ¹åã«ç§»åããã¦ç³»åãã¼ã¿ (ãããã¯ä¸é層) ãçæããï¼ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³ã®ã»ãªãªã¼ãè¸è¥²ãã¦ãã¦ï¼Decoder ã®å¦ç¿æã¯å ¥å ã«æ£è§£ãã¼ã¿ãå ¥ãï¼ãã¹ãæã¯åã¹ãããã®åºåãå ¥ããï¼ã¾ã Encoder ã®ä¸é層ãä¿æã㦠Attention ã«å©ç¨ãã¦ããï¼
Dilated Convolutions ã¯èªå·±å帰ã¢ãã«ã®ä¸ç¨®ã§ãã WaveNet (arXiv, 2016/9) ãæåã§ï¼ç³»åãã¼ã¿ãããã¿è¾¼ãã®ã« DeepMind ã好ãã§ä½¿ã£ã¦ããï¼
WaveNet 㯠Dilated Convolutions ã«ããè¨ç®ã³ã¹ãã®å¢å¤§ãæå¶ããããã«ï¼Residual Block ããã¾ã使ã£ã¦ããï¼musyokuæ°ã解説ããã°ã§æ¬¡ã®ããã«ææããã¦ããï¼
ãããã¯å ã®ç³ã¿è¾¼ã¿å±¤ãå¢ããã¦dilationã大ããããã¨å容éã®å¹ ã¯ææ°é¢æ°çã«å¤§ãããªãããããã¯ãç©ã¿éããã¨ç·å½¢ã«å¤§ãããªãã¾ãã
ããã§ã Dilated Convolutions ã®è¨ç®ã³ã¹ãã¯å¤§ããã®ã§ï¼Fast Wavenet ã®ããã«éå»ã«è¨ç®ãããã¥ã¼ãã³ãä¿åãã¦åå©ç¨ãããã¨ã§ã¹ãã¼ãã¢ãããå³ãã®ãæã¾ããï¼
5: åæ§ç¯ (å翻訳) ã«ãããã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³
ä¾ãã°ãæ¥æ¬èªâè±èªâæ¥æ¬èªãã¨å翻訳ãã¦å ã®æç« ã復å ã§ããã°è¯ã翻訳ã¢ãã«ã ã¨è¨ãããï¼ã¨ããçºæ³ã§ããï¼ãã®è«æã§ã¯ Attention ä»ãã® Encoder-Decoder ã¢ãã«ããã¼ã¹ã« Decoder ã®é ã層 ãã Encoder ã®é ã層 ãåæ§ç¯ããï¼
ã¾ã Decoder ã®é ã層 ãã weight ãæ±ã (ããã¯é常㮠Attention 㧠Encoder ã®ä¸é層ãã Alignment Weight Vector ãæ±ããã®ã¨åãæç¶ã)ï¼æ¬¡å¼ã®ããã« Inverse Context Vector ãæ±ããï¼
é ã層 ã®åæ§ç¯ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼ãã ãï¼ ã¯æ´»æ§åé¢æ°ï¼
ãã£ã¦åæ§ç¯ã®åå¸ã¯æ¬¡å¼ã®ã¨ããï¼ãã ãï¼ ã¯ã½ããããã¯ã¹é¢æ°ï¼
ã¢ãã«ã®å¦ç¿ã¯ã翻訳 (Encoder-Decoder) ã®ç®çè¨èªã®å°¤åº¦ ãã¨ãåæ§ç¯ (Reconstructor) ã®å è¨èªã®å°¤åº¦ ãã®åãæ大åãããããªç®çé¢æ°ã次å¼ã®ããã«è¨å®ãã¦è¡ãï¼ ã¯ç¿»è¨³ã¨åæ§ç¯ã®å²åã決ãããã¤ããã©ã¡ã¼ã¿ï¼
6: Convolution 㨠Gated Linear Units (GLU) ãå¤å±¤ã«ãã GCNN
ãã¼ãã¬ãã·ãã£ã«ãã㦠WikiText-103 ã® SOTA ã㤠Google Billion Word ãã³ããã¼ã¯ (1GPU) ã®æé«ç¹ãå©ãåºãï¼RNN ç³»èãããè¨ç®ãéã CNN ç³»èã®ãããã¢ãã«ã§ããï¼ä¸å³ã® Convolution 層㨠Gating 層ã®ãããã¯ã¯ L層 (å®é¨ã§ã¯ 8層㨠13層) ãç©å±¤ãã¦ããï¼ã¾ãåãããã¯ã¯ bottleneck æ§é ã® Residual Block (å Res block ã¯æ大 5層æã¤) ã§å ¥åãåºåã«å ç®ããã¦ããï¼
èªå½æ°ã ï¼åãè¾¼ã¿ãµã¤ãºã ï¼Word Embeddings ã®ãã¼ãã«ã ã¨ããã¨ï¼é ã層 ã¯æ¬¡å¼ã«ãã£ã¦è¡¨ãããï¼ãã ãï¼ ã¯ ã¸ã®å ¥å (Word Embeddings ãããã¯åã®é ã層ã®åºå)ï¼ï¼ï¼ï¼ ã¯å¦ç¿ãã©ã¡ã¼ã¿ï¼ ã¯ã·ã°ã¢ã¤ãé¢æ°ï¼ ã¯è¡åéã®ã¢ããã¼ã«ç©ã§ããï¼ã«ã¼ãã«ã¯å°æ¥ã®åèªãåç §ããªãããã«ç³ã¿è¾¼ã¿ã¸ã®å ¥åãã·ãããããï¼ã¾ã ãã«ã¼ãã«ã®å¹ ã¨ãã¦ï¼ è¦ç´ ã§å é ãã¼ãããã£ã³ã°ãã¦ããï¼
å層ã®åºåã¯ã²ã¼ã ã«ãã£ã¦æ¸¡ãããæ å ±ãå¶å¾¡ãããç·å½¢åå ã¨ãªãï¼ãã£ã¦ãã®ã²ã¼ãã Gated Linear Units (GLU) ã¨ããï¼
å®é¨çµæã¯ãããã以ä¸ã®ã¨ããï¼
- WikiText-103 ã§ã® PPL 㯠LSTM-1024 (score 48.7) < GCNN-8 (score 44.9)
- GPU ã§ã®æ¯ç§ãããã®å¦çæ°ã¯ LSTM-2048 (2282 tokens) <<< GCNN-22 (45878 tokens)
- ãããã¯ã¼ã¯ãæ·±ãã»ã©ï¼ã¾ãæèé·ãé·ãã»ã©ï¼PPL ã¯åç´æ¸å°ãã
- éã¿æ£è¦å << å¾é ã¯ãªããã³ã° < éã¿æ£è¦å + å¾é ã¯ãªããã³ã°
RNNã¯è¡°éãã¾ããï¼1GPUããã«ã§GCNNãSOTAã®LSTMããã¼ãã¬ãã·ãã£ã§åå©ï¼Gated CNNã¯æ¬¡å¼ã®ãããªGLU風ã²ã¼ãä»ãConvolutionã¬ã¤ã¤ã¼ã13層ã¹ã¿ãã¯ããã¢ãã«ï¼Dilated CNNãQRNN以éã¯CNNå®å ¨åå©ã®æ§å³ã窺ãã㪠pic.twitter.com/uQ6onBE9W8
— Ryobot | ããã¼ã£ã¨ (@_Ryobot) 2016å¹´12æ26æ¥
ãRNNã¯è¡°éãã¾ãããã¯åè«ã§ããï¼ååã¯ï¼
ç³»åãã¼ã¿ãé ã«å
¥åããã¿ã¤ãã®ãããã¯ã¼ã¯ã¯ä¸¦åæ§ãå¼±ãè¨ç®é度ã«æéããããã®ã§ãæ¹è¯âå®é¨ãã®ãµã¤ã¯ã«ãé
ããªã£ã¦ãã¾ãï¼æ®å¿µãªãã RNN ç³»èã¯ãããããå¦ç¿ (ãã«ãã·ã¼ã±ã³ã¹ãªä¸¦åå) ã§ã GPU ã®æ©æµãã»ã¨ãã©å¾ãããªãããã«ï¼ãªã³ã©ã¤ã³å¦ç¿ã§ã¯ CPU ã®æ¹ãæ©ãï¼
7: çãªã²ã¼ãã§ã¨ãã¹ãã¼ãã®ãµããããã¯ã¼ã¯ãé¸æ
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (OpenReview, 2016/12)
last author ã Geoffrey Hinton, Jeff Dean ã® Google Brain 製ï¼ã²ã¼ãã«ããå¶å¾¡ã§æ°ååä½ã®ãã£ã¼ããã©ã¯ã¼ãã®ãµããããã¯ã¼ã¯ã®å¹¾ã¤ããé¸æãççµåãã Mixture-of-Experts layer (MoE) ãææ¡ãã¦ããï¼å®é¨ã§ã¯ MoE ã Stacked LSTM éã«é©å¿ãã¦ï¼å¤§è¦æ¨¡ãªè¨èªã¢ãã«ãæ©æ¢°ç¿»è¨³ã®ã¿ã¹ã¯ã§ã¯ SOTA ã®ã¢ãã« (e.g., Google ãã¥ã¼ã©ã«æ©æ¢°ç¿»è¨³) ãããã¼ãã¬ãã·ãã£ã¨ BLEU score ã§åªãã¦ããï¼