ãã®è¨äºã§ã¯ãã«ãã¨ã¼ã¸ã§ã³ã深層å¦ç¿ã®åæã®ã¢ã«ã´ãªãºã ã§ããCOMAã¢ã«ã´ãªãºã ãç´¹ä»ãã¾ãã å ã®è«æã¯Foerster et al. Counterfactual Multi-Agent Policy Gradients. AAAI, 2018.ã§ãã
å°å ¥
TDæ³ã¨æ¹çå¾é æ³ã«ã¤ãã¦å¾©ç¿ãã¾ãã
TDæ³
å¼·åå¦ç¿ã§é »ç¹ã«ãã«ã³ãéç¨ã¯ä»®å®ãã¾ããã¨ãªãã°, ããç¶æ ã®ä¾¡å¤é¢æ° ãç¥ããã. åepisodeã§trajectory (å±¥æ´)ã¨ã㦠$$ \{ (s_0, a_0, r_0), (s_1, a_1, r_1), \ldots, (s_{T - 1}, a_{T - 1}, r_{T - 1}) \} $$ ãå¾ãããã®ã§ããããã使ã£ã¦ä¾¡å¤é¢æ°ãå¦ç¿ããã¦ããã¾ãã
æ¹çå¾é æ³
æ¹çå¾é æ³ã¯ä¸»ã«actor-criticãã¼ã¹ã®ææ³ã§ä½¿ããã¾ããactorãpolicy(æ¹ç) ${\pi}_{{\theta}}( {a} | {s})$ãå¸ããããç¶æ ã§ã©ã®ãããªè¡å$a$ãããããåºåãã. criticã¯ä¾¡å¤é¢æ°$V(s)$ãè¡å価å¤é¢$Q({s}, {a})$ãæ¨å®ãã. éè¦ãªå®çã¨ãã¦, 以ä¸ã®æ¹çå¾é å®çããã.
æ¹çå¾é
å®ç(informal)
æ¹ç$\pi_{\theta}(a|s)$ã®ä¸ã§ã®ç´¯ç©å ±é
¬ã®æå¾
å¤ã$J(\theta)$ã¨ãã. 以ä¸ãæãç«ã¤.
$$
\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}} \left[ \sum_{t = 1}^{T} \nabla_{\theta} \log (\pi_{\theta}(a_{t} | s_{t}) (Q^{\pi_{\theta}}(s, a) - b(s)) \right].
$$
詳ãã解説ãªã©ã¯ãå¼·åå¦ç¿ãï¼æ£®æå²éè, è¬è«ç¤¾ï¼ã«è¼ã£ã¦ãã¾ãã ããã§, $b(s)$ã¯ç¶æ $s$ã«ã®ã¿ä¾åãããã¼ã¹ã©ã¤ã³é¢æ°ã¨å¼ã°ãããã®ã§ãããã¼ã¹ã©ã¤ã³é¢æ°ã®é¸ã³æ¹ã§åæ£ã®å¤§ããã決ã¾ã£ã¦ãã. æ§ã ãªãã¼ã¹ã©ã¤ã³é¢æ°ãç 究ããã¦ãã.
æ¹çå¾é æ³ãç¨ãããã®ã«, REINFORCEæ³(Williams, 2019)ããã. åã¨ãã½ã¼ãã®å±¥æ´$(s_0, a_0, r_0), (s_1, a_1, r_1), \ldots, (s_{T - 1}, a_{T - 1}, r_{T - 1})$ãå¾ããã³ã«, $$ c_{t} := \sum_{l = t}^{T - 1}r_{l}, \quad \forall t \in { 0, 1, \ldots, T - 1 } $$ ã¨è¨ç®ãã¦, ãã©ã¡ã¼ã¿ã以ä¸ã®ããã«æ´æ°ãã: $$ \theta \leftarrow \theta + \alpha \frac{1}{T} \sum_{t = 0}^{T - 1} (c - b(s_{t}))\nabla \log \pi_{\theta}(s_{t}, a_{t}) $$ 注æããã¹ãç¹ã¯ãã¢ã³ãã«ã«ããµã³ããªã³ã°ã«ãã£ã¦$Q$ãæ¨å®ãã¦ãããã¨ãã, REINFORCEã¯actor-criticã§ã¯ãªããcriticãªãã®æ¹çå¾é æ³ã¨ãããã¨ã§ãã
actor-criticã¡ã½ããã®å ´åã¯, $b(s) = V(s)$ã¨ãã¦, ã¢ããã³ãã¼ã¸é¢æ° $ A(a_{t}, s_{t}) = r_{t} + V(s_{t + 1}) - V(s_{t}) $ãç¨ãã¦, 以ä¸ã®ããã«ãã©ã¡ã¼ã¿ãæ´æ°ãã. $$ \theta \leftarrow \theta + \frac{1}{T} \sum_{t = 0}^{T - 1} \nabla_{\theta}\log \pi_{\theta}(a_{t}|s_{t}) A(a_{t}, s_{t}) $$
ããã§ä½¿ããã$V(s)$ã¯criticãæ¨å®ãããã®ã使ãã®ã§ãã. (off-policyã®å ´åã¯ã¢ããã³ãã¼ã¸é¢æ°ã¨ãã¦, $A(a_{t}, s_{t}) = r_{t} + \max_{a \in \mathcal{A}} Q^{\pi_{\theta}}(s_{t + }, a)$ã¨ãã.)
æ¬é¡
Nä½ã®ã¨ã¼ã¸ã§ã³ãã«ã¤ãã¦èãã. ç´ æ´ãªæ¹æ³ã¨ãã¦å$i\in \{ 1, \ldots, N \}$çªç®ã®ã¨ã¼ã¸ã§ã³ãã®æ¹çå¾é ãä¸å¾ $$ G = \nabla_{\theta}\log \pi_{\theta}(a_{t} | s^{i}_t) \left( Q(s_{t}, a_{t}) - V\left( s_t \right) \right) $$ ã¨å®ããã¨ãã¾ããããã§, $s_{t}$ã¨$a_t$ã¯ããããå ¨ã¨ã¼ã¸ã§ã³ãã®joint stateã¨joint actionã§ãã, $r_t$ã¯å ¨ã¨ã¼ã¸ã§ã³ãå ±éã®rewardã§ããããã ã¨ãã®ã¨ã¼ã¸ã§ã³ãã®è¡åãã©ããããå ¨ä½ã®å ±é ¬ã«è²¢ç®ããããã¾ãæ¨è«ãã¥ãã ("Credit Assignment Problem")ãçºçãã¾ããä»ã®ã¨ã¼ã¸ã§ã³ãããã¾ãæ¹çãæ¢ç´¢ãã¦ããæä¸ã ã¨, $G$ã¯ãã¤ã¸ã¼ã«ãªã, èªåã®æ¹çããã¾ãæ¹åã§ããªãå ´åãããã¾ãã
ææ¡ææ³
å¦ç¿ãå®å®ãããããã«ã¢ããã³ãã¼ã¸é¢æ°ã工夫ããªããã°ãããªãã¨ããã®ãåºçºç¹ã§ãããã¼ã¹ã©ã¤ã³é¢æ°ãå¤æ´ãã¾ããç´è¦³çã«ã¯, ãã»ãã®ã¨ã¼ã¸ã§ã³ãããã®ã¾ã¾ã®è¡åãåã£ãæã«èªåï¼ã¨ã¼ã¸ã§ã³ã$i$ï¼ã®ä»ã®æ¹çã¯ã©ããããè¯ããããç¥ãããã§ããCOMAã¯ãã®ç´è¦³ã以ä¸ã®ã¢ããã³ãã¼ã¸é¢æ°ãæ§ç¯ãããã¨ã§ç¥ããã¨ãã¾ãã $$ A^{i}(s, a) = Q(s, a) - \sum_{u_{i} \in \mathcal{A}} \pi_{\theta} (u_{i}, H_{i} ) Q(s, (\mathbf{u}^{- i}, u_{i})) $$ ããã§$u_{i}$ã¯ã¨ã¼ã¸ã§ã³ã$i$ã®è¡å, $\mathbf{u}^{-i}$ã¯ã¨ã¼ã¸ã§ã³ã$i$以å¤ã®è¡åãåºå®ããæã®è¡åãã¯ãã«, $H_{i}$ã¯ã¨ã¼ã¸ã§ã³ã$i$ã®è¡åã»è¦³æ¸¬å±¥æ´ã§ãã
å®é¨
æå¾ã«COMAã¢ã«ã´ãªãºã ãåããã¦ã¿ããã¨æãã¾ããã³ã¼ãã¯ãã¡ãã«ããã¾ãã
å®é¨ç°å¢
èããç°å¢ã¯ä»¥ä¸ã®éãã§ãã å³ã®ããã«ã4ã¤ã®ã¨ã¼ã¸ã§ã³ãï¼ç´«ãéãç·ããªã¬ã³ã¸ï¼ãããããèªåã®è²ã¨åãè²ã§å¡ããã¦ãããã¹ã«ç§»åãããã¨ãã¾ãã å ·ä½çã«ã¯ãç´«ãéãç·ããªã¬ã³ã¸ã¯ãããã座æ¨(0, 0)ã(0, 5), (6, 0), (5, 6)ãåºçºãã¦ãããããã®å¯¾è§ç·ä¸ã§ãã座æ¨ï¼5, 6ï¼, (5,. 0), (0, 6), (0, 0)ã«ç§»åãããã¨ããç¶æ³ã§ãã
ã¨ã¼ã¸ã§ã³ããã¡ã¯åæéã¹ãããã§ãã®ãã¹ã«ã¨ã©ã¾ãããå·¦å³ä¸ä¸ã®ãã¹ã«ç§»åãããã¨ãã§ãã¾ãã é»ãå¡ããã¦ãããã¹ã«ã¯ç§»åã§ãã¾ãããã¢ã¯ã·ã§ã³ã¯1ã ã¨ä¸ã2ã ã¨å³ã«ãï¼ã ã¨ä¸ã«ãï¼ã ã¨å·¦ã«ãï¼ã ã¨ãã®å ´ã«ã¨ã©ã¾ããã¨ããå ·åã§ãã åæéã¹ãããã®å ±é ¬ã¯ä»¥ä¸ã®ããã«ãã´ã¼ã«ã¾ã§ã®ã¦ã¼ã¯ãªããè·é¢ã¨ãã¾ããã¤ã¾ãã $$ \sum_{i = 1}^{4} ( x_{i} - x^{goal}_{i} ) ^2 + (y_{i} - y^{goal}_{i}) ^{2} $$ ã§ãã
ã¢ãã«
- Actor: äºæ¬¡å ã®åº§æ¨ãå ¥åã¨ãã¦ã5ã¹ãããã¾ã§ãè¨æ¶ãã¦ãGRUã§è¡åãåºåãã
- Critic: å ¥åã¯ç¤é¢å ¨ä½ã9ãã£ãã«ã®ç»åã¨ãã¦è¦ã¾ã. 1~4ãã£ãã«ãåã¨ã¼ã¸ã§ã³ãã®ä½ç½®ãã5~10ãã£ãã«ãè¡åãåã¨ã¼ã¸ã§ã³ãã®è¡åã表ãã¾ã. ãã®å ¥åãCNNã§è¡å価å¤é¢æ°ã®å¤ãåºåãã¾ã.
çµæ
çµè«ããè¨ãã¨çµæã¯å¾®å¦ã§ããã¾ããµã³ãã«å¹çæ§ãè¯ããªãã, ãã¬ã¼ãã³ã°ä¸ã®åæ£ã大ããã§ãã
以ä¸ã®ããã«ãä¸æãããã¨ï¼äººã®ã¨ã¼ã¸ã§ã³ãããã¾ãããããã®ã´ã¼ã«ã«å°éããå ´åãããã¾ããï¼ä¸å³åç §ï¼, 大åã®å ´åã¯ããã¯ãªããã誰ããããåãå ´æã«ã¨ã©ã¾ã£ã¦ãã¾ã£ãããã¾ã.
ææ³
COMAã¢ã«ã´ãªãºã ã¯ãã«ãã¨ã¼ã¸ã§ã³ã深層強åå¦ç¿ã®åéã§ã¯æå 端ã®æè¡ã¨ã¯è¨ãã¾ãã. ããã, "Counterfactual"ã¨ããã¢ã¤ãã¢ã¯é¢ç½ãã§ã. 人éããèªåãããä»®ã«ä»ã®è¡åããã¦ãããããã¼ã å ¨ä½ã®ããã©ã¼ãã³ã¹ã¯ã©ããªã£ã¦ããã ãããã¨èãããã¨ã¯ããããã¨æãã¾ãã
ãã®ããã°ã¯æ ªå¼ä¼ç¤¾EfficiNet Xã®ããã¯ããã°ã§ãã