ãå æ¥ãAction Value Gradient (AVG)ã試ãã¦æå ã§åãã¨ããã¾ã§ç¢ºèªãã¾ããã
ããªãã¬ã¤ãããã¡ã使ããããããµã¤ãº1ã®ãªã³ã©ã¤ã³å¼·åå¦ç¿ã§ãHumanoid-v5ã§ã®å ±é ¬ã伸ã³ã¦ãããã¨ã確èªã§ãã¦ãã¾ãã
ãä¸æ¹ã2Mã¹ãããã»ã©åãã¦ããããããç¨åº¦ã®æ§è½ã«ãªãã¨ããããã«ããµã³ãã«å¹çãããæªãããã«ã¯æããã¾ãããã¨ã®è«æã§ãã7 Conclusionãã®Limitations and Future Workã§è§¦ãããã¦ãããé©æ ¼åº¦ãã¬ã¼ã¹ã¯å¯¾å¿çã®ä¸ã¤ãªã®ã§ã¯ãªããã¨æãããã¦ãããããä»åã¯ããã«ææ¦ãã¦ã¿ã¾ããã
å®è£
ãé©æ ¼åº¦ãã¬ã¼ã¹ã¯TD(λ)ãå¾æ¹è¦³æ¸¬ã¨ãã¦å®è£ ãããã®ã§ããã¬ã¼ã¹ãã¯ãã« ã以ä¸ã®ããã«æ§æãã¾ãã
ãã¤ã1ã¹ãããTD誤差
ãèããã¨ãã«ãæ´æ°ã以ä¸ã®ããã«ãªãã¾ãã
ãã¡ããã¨èããããã§ã¯ããã¾ãããè¡å価å¤ã«ã¤ãã¦ãåãããã«ãªãã ããã¨ä¿¡ãã¦ããã®ã¾ã¾å®è£ ããã¨ä»¥ä¸ã®ããã«ãªãã¨æããã¾ããï¼å ¨ä½ã¯ãã¡ãã以ä¸ã¯ä¸»è¦é¨åãæç²ï¼
class AVG: def __init__(self, cfg: argparse.Namespace) -> None: ... # ã³ã³ã¹ãã©ã¯ã¿ã®ä¸ã§ãã¬ã¼ã¹ãã¯ãã«ãæºå with torch.no_grad(): self.eligibility_traces_q = [ torch.zeros_like(p, requires_grad=False) for p in self.Q.parameters() ] def update(...) -> None: ... q = self.Q(obs, action.detach()) # N.B: Gradient should NOT pass through action here with torch.no_grad(): next_action, action_info = self.actor(next_obs) next_lprob = action_info["lprob"] q2 = self.Q(next_obs, next_action) target_V = q2 - self.alpha_lr * next_lprob reward = self.symlog(reward) delta = reward + (1 - done) * self.gamma * target_V - q ... self.qopt.zero_grad() if self.use_eligibility_trace: q.backward() with torch.no_grad(): for p, et in zip(self.Q.parameters(), self.eligibility_traces_q): et.mul_(self.et_lambda * self.gamma).add_(p.grad.data) p.grad.data = -2.0 * delta * et else: qloss = delta**2 qloss.backward() self.qopt.step() def reset_eligibility_traces(self) -> None: for et in self.eligibility_traces_q: et.zero_()
ãuse_eligibility_trace
ã¨ãããã©ã°ã§ä½¿ãã使ããªãããåãæ¿ããããããã«ãã¦ãã¾ããuse_eligibility_trace
ã®ãªã³ã»ãªãã«ããããOptimizerã® step()
ã§ãã©ã¡ã¼ã¿ãæ´æ°ããããã¨ã¯çµ±ä¸ãããã£ããããgradãç´æ¥æ´æ°ããã¨ããå½¢ã§ã®å®è£
ã«ãã¾ããã
ãçå±ã¨ãã¦ãTD(λ)ã«ãã㦠ã¨ããå ´åã1ã¹ãããTDã¨ä¸è´ãã¾ãããã®å®è£
ã«ããã¦ãuse_eligibility_trace=True
ãã¤self.et_lambda=0
ã¨ããã¨ãã«ã¯ãuse_eligibility_trace=False
ã¨æ°å¤çã«åããã¨ï¼åãã·ã¼ãå¤ã使ã£ãã¨ãã«1åæ´æ°ããå¾ã®ãããã¯ã¼ã¯ãã©ã¡ã¼ã¿ãå®å
¨ã«ä¸è´ãããã¨ï¼ã確èªãã¾ããã
çµæ
ãé©æ ¼åº¦ãã¬ã¼ã¹ã使ããªã1ã¹ãããTDããã³ãé©æ ¼åº¦ãã¬ã¼ã¹ã使ãã0.0(ããã¯1ã¹ãããTDã¨ä¸è´ããã¯ãã§ã), 0.1, 0.2, 0.4, 0.8ãªã©ã§å®é¨ãã¾ããã
ãããã¤ãã®ã·ã¼ãã§å®é¨ãã¦ã¿ã¾ããããã©ãããã¾ãé©æ ¼åº¦ãã¬ã¼ã¹ãæå¹ã§ããçµæã¨ã¯ãªãã¾ããã§ããã
ã·ã¼ãå¤ | çµæ |
---|---|
46 | |
47 | |
48 |
ãããèããã¨ã®ãã®ã1ã¹ãããTDã¨é·æçã«ã¯å®å ¨ä¸è´ãã¦ããªãã®ã¯ãããããã¨ã«æããã¾ããæ´æ°1åã§ã¯å³å¯ã«ä¸è´ãã¦ããã¯ãã§ãããã©ããã®ã¿ã¤ãã³ã°ã§ããå§ããè¦å ãã©ã³ãã æ§ãæ¼ç®èª¤å·®ã«ããã®ããããã¾ããã
åæ
ãé©æ ¼åº¦ãã¬ã¼ã¹ãå ¥ãã¦ããµã³ãã«å¹çãæ¹åããªãçç±ãç¥ãããã«ã1ã¹ãããTDã§ã®å¦ç¿ã«ã¤ãã¦ãTD誤差ãè¨é²ãã¦ãããããã¾ããã
ããããè¦ãã¨ãTD誤差èªä½ã¯ããã¾ã§å¤§ãããªãã®ã§ä¾¡å¤é¢æ°ã¯ããç¨åº¦å¦ç¿ã§ãã¦ããã®ã§ã¯ãªããã¨æãã¾ããã¤ã¾ãããµã³ãã«å¹çãæªãåå ã¯ä¾¡å¤é¢æ°ã®é¨åã§ã¯ãªãã®ããããã¾ããã
ãä¸æ¹ãå¦ç¿ãé²ãã¨å¸¸ã«TD誤差ãè² ã«ãªã£ã¦ããã¨ãããã¨ããã次ã¹ãããã§ã®Qå¤ãéæé©ãªãã®ãã¤ã¾ã次ã®è¡åã¨ãã¦éæé©ãªè¡åãé¸ãã§ããã®ã§ã¯ãªããã¨èªã¿åãã¾ãã
ãç·åçã«è¦ã¦ã価å¤é¢æ°ãããæ¹çã®å¦ç¿ã«é£ãããã®ã§ã¯ãªããã¨èãããã¾ãã
ãå®éãã³ã¼ãä¸ã§ã¯ lprob
ã¨ãã¦å¾ãããè¡åã®å¯¾æ°ç¢ºçå¯åº¦ãããããããã¨ãã¨ã¦ãå°ãããªã£ã¦ãããã¨ããããã¾ãã
ãã¨ã³ãããã¼æ£ååãå¼·åã«å¹ãã¦ããã¨äºæ³ããã¾ãããä¸æ¹ã§ãã¨ã³ãããã¼æ£ååã®ä¿æ° alpha_lr
ãã¡ãã£ã¨å°ããããã¨æ¥ã« lprob
ã大ãããªã£ã¦åãè¡åã°ããã«ãªã£ã¦ãã¾ããããªã®ã§ãããã®èª¿æ´ã¯ããªãã·ãã¢ã§ããããã§ãã
ã¾ã¨ã
ãAVGã«é©æ ¼åº¦ãã¬ã¼ã¹ãå°å ¥ãã¦ã¿ã¾ãããããã¾ãå¹æã¯ãªãããã§ãããç¾ç¶ã®ãµã³ãã«å¹çã®æªãã¯ä¾¡å¤é¢æ°ãããæ¹çç±æ¥ãªã®ã§ã¯ãªããã¨æããã¾ãã次åã§ã¯æ¹çãæ¹åããæ¹æ³ã«ã¤ãã¦æ¤è¨ãã¾ãã