æµãèªã¿ã ã¨ã¡ããã¨åãããªãã£ãã®ã§ã¡ã¢ã
æºå(AlphaGo)
- policy network : ç¤é¢ã¨ãã®ç¹å¾´éãå ¥åã¨ãã¦åãåããåãã¹ã«æã¤ç¢ºçãè¿ããã¥ã¼ã©ã«ãããã
- value network: ç¤é¢ã¨ãã®ç¹å¾´éãå ¥åã¨ãã¦åãåãããã®ç¤é¢ã§ã®åçãè¿ããã¥ã¼ã©ã«ãããã
AlphaGo ã§ã¯ã¾ã policy network ãããã®æ£èãã¼ã¿ããæ師ããå¦ç¿ã§äºåå¦ç¿ããããã®å¾èªå·±å¯¾æ¦ã«ããå¼·åå¦ç¿ã«ãã£ã¦ããã«æ¹åããã¦ããã
AlphaGo ã®å¼·åå¦ç¿ãã¼ã
- æ師ããå¦ç¿å¾ã® policy network ã®ãã©ã¡ã¼ã¿ $\rho_0$ ããå¦ç¿ãã¹ã¿ã¼ããããèªå·±å¯¾æ¦ã®çµæãã policy network ã®ãã©ã¡ã¼ã¿ã¯éææ´æ°ããã¦ããããããã $\rho_1, \rho_2, \cdots$ ã¨ããã$t$ åç®ã®èªå·±å¯¾æ¦ã§ã¯ãç¾å¨ã®ãã©ã¡ã¼ã¿ $\rho_t$ ã¨ããããã以åã®ãã©ã¡ã¼ã¿ $\rho_{t'}$ ã¨å¯¾æ¦ãè¡ãããã
- å®éã«ã¯ãæ¯è©¦åãã¨ã«ãã©ã¡ã¼ã¿ãä¿åãã¦ãããéãè¨å¤§ã«ãªã£ã¦å¤§å¤ãªã®ã§ããã©ã¡ã¼ã¿ã®ä¿åã¯ä¸å®å復ãã¨ã«è¡ãããã
- ãã©ã¡ã¼ã¿ã®æ´æ°ã¯ REINFORCE ã¢ã«ã´ãªãºã ã«ãã£ã¦è¡ãããããã©ã¡ã¼ã¿ã®æ´æ°å¹ $\Delta\rho$ ã¯ä»¥ä¸ã®å¼ã§è¡¨ããããããã§ã$\alpha$ ã¯å¦ç¿çã$T$ ã¯èªå·±å¯¾æ¦ãçµäºããã¾ã§ã«ããã£ãã¹ãããæ°ã$a^i$ 㨠$s^i$ ã¯ããããèªåã $i$ ã¹ãããç®ã«åã£ãè¡åå¤ã¨ç¶æ ã$p_\rho(\cdot \mid \cdot)$ 㯠$\rho$ ããã©ã¡ã¼ã¿ã¨ãã policy network ã®åºåå¤ã$z$ ã¯ãã®å¯¾å±ã§èªåãåå©ãããªã $+1$ãæåãããªã $-1$ ã¨ãªããããªå ±é ¬å¤ã$b^i$ ã¯åæ£ãå°ããããããã«ç¨ãããããã¼ã¹ã©ã¤ã³ã¨å¼ã°ããå¤ã
\[
\Delta\rho = \alpha \sum_{i=1}^{T} \frac{\partial\log{p_\rho(a^i \mid s^i)}}{\partial\rho}(z - b^i)
\]
- èªå·±å¯¾æ¦ä¸ã«æã¤æã¯ãpolicy network ã§ã®ç¢ºçããããã®ã¾ã¾ãµã³ããªã³ã°ãã¦ããï¼$a^i \sim p_\rho(\cdot \mid s^i)$. ã¤ã¾ããèªå·±å¯¾æ¦ä¸ã«ã¯æ¢ç´¢ãè¡ã£ã¦ããªãã
- REINFORCE ã«ãã policy network ãå¦ç¿ã§ããå¾ã¯ãèªå·±å¯¾æ¦ã®æ£èãã¼ã¿ãå ã« value network ãæ師ããå¦ç¿ã®è¦é ã§å¦ç¿ãããã
- ãã¹ãæã«ã¯ãMCTS ã«ããæ¢ç´¢ä¸ã®è©ä¾¡é¢æ°ã¨ãã¦ãããã® policy network 㨠value network ãç¨ããã
Zero ã®å¦ç¿æ¹æ³
次㫠Zero ã«ã¤ãã¦ãèªå·±å¯¾æ¦é¨åã«é¢é£ããéãã¨ãã¦ãAlphaGo ã§ã¯ policy network 㨠value network ã¯å¥ã ã®ãã®ã ã£ãããZero ã§ã¯ãããã1ã¤ã®ãããã¯ã¼ã¯ã«çµ±ä¸ããããã¥ã¼ã©ã«ãããã®æ§æã«å¤æ´ãå ãããã¦ããããããã£ã¦ãAlphaGo ã§ã¯ policy network 㨠value network ã®è¨ç·´ã¯å¥ã ã®æ®µéã§è¡ãããã®ã«å¯¾ãã¦ãZero ã§ã¯åä¸ã®è¨ç·´ã«ã¼ãå ã§æé©åãããã
Zero ã§ãèªå·±å¯¾æ¦ãè¡ããã以ä¸ã®ãããªç¹ã AlphaGo ã¨ç°ãªã£ã¦ããã
- èªå·±å¯¾æ¦ä¸ã«ã MCTS ã«ããæ¢ç´¢ãè¡ããããã«ãããMCTS ã«ããåãã¹ã«æã¤ç¢ºçå¤ã®ãã¯ãã« $\pi$ ãå¾ãããã
- ãã©ã¡ã¼ã¿ãæ´æ°ããéã®æ´æ°å¼ãå¤ãã£ã¦ããããã¥ã¼ã©ã«ãããã®åºåå¤ã $(p, v)$ ã¨ããã以ä¸ã®è¦ç´ ããã¹é¢æ°ã¨ãã¦ãã©ã¡ã¼ã¿ãæé©åããã
- $v$ ã®ãã¹ï¼$(v-z)^2$
- $p$ ã®ãã¹ï¼$-\pi^\mathrm{T} \log{p}$ (ã¯ãã¹ã¨ã³ãããã¼ã§ããã¥ã¼ã©ã«ãããã«ããåºåã MCTS ã®åºåã«ã§ããã ãé¡ä¼¼ãã¦æ¬²ããã¨ããæå³ããã)
- ãã¨ãã¥ã¼ã©ã«ãããã®ãã©ã¡ã¼ã¿ã®æ£ååï¼$\|\theta\|^2$
èå¯
ã¡ãã£ã¨èªä¿¡ãç¡ããã©æ¬¡ã®ãããªç解ã§åã£ã¦ããã ãããï¼
- AlphaGo ã§ã¯èªå·±å¯¾æ¦ä¸ã«æã¤æã¯ãµã³ããªã³ã°ã«ãããã®ã ã£ãããæ´æ°æã®åæ£ã大ãããã ã£ãããZero ã§ã¯æ¢ç´¢çµæãç´æ¥æ師ãã¼ã¿ã¿ããã«ã㦠$p$ ãå¦ç¿ãã¦ããããåæ£ãå°ããããããã«ãããå®å®ãããã©ã¡ã¼ã¿ã®æ´æ°ãããã®ã«å¿ è¦ãªãµã³ãã«æ°ãå°ãªãã¦æ¸ã¿ããã
- $v$ ã®å¦ç¿ã«ã¤ãã¦ãããµã³ããªã³ã°ãããæ¢ç´¢çµæã®æ¹ãå¤æã¨ãã¦ã®è³ªãé«ãã®ã§ãåæã®æ®µéãã確度ã®é«ãæ å ±ãæ師ãã¼ã¿ã«ç¨ããããã¨ããã¡ãªãããããããã
- AlphaGo ã§ã¯ policy network ã®å¦ç¿é¨åã¨æ¢ç´¢ã¢ã«ã´ãªãºã é¨åã¯åé¢ãã¦ãã(ç 究ã®æµãçã«ããæ ¹æ¬ã¨ãªãã¢ã«ã´ãªãºã ã« MCTS ããããã®è©ä¾¡é¢æ°ã¨ãã¦ã§ããã ã質ã®è¯ããã®ãä½ãããã¨ããã¢ããã ã£ãããã«æã)ããZero ã«ãªã£ã¦å¦ç¿ã®éç¨ã«æ¢ç´¢ã¢ã«ã´ãªãºã ãç´æ¥å ¥ãããã«ãªã£ããããã«ããããããã¹ãæã«ä¸è²«æ§ã®ããæ¹çãåããããã«ãªã£ã¦ããã
ä½è«
ãZero ã¯4ã¤ã® TPU ã ãã使ã£ã¦å¦ç¿ããããã¨è¨ããã¦ããã®ãè¦ããããããã¯ãã¹ãæã®è©±ã§ãããå¦ç¿æã«ã¯ 64 åã® GPU 㨠19 åã® CPU ãã©ã¡ã¼ã¿ãµã¼ãã¼ã使ç¨ãããã¨æ¸ããã¦ããã"""Each neural network fθi is optimized on the Google Cloud using TensorFlow, with 64 GPU workers and 19 CPU parameter servers."""