ããã«ã¡ã¯ï¼ ã¢ã«ããã²ã¼ã ã¹ ã¯ã©ã¤ã¢ã³ãã¨ã³ã¸ãã¢ã®Suã§ãã
ãã®è¨äºã¯Â Akatsuki Advent Calendar 2025 24æ¥ç®ã®è¨äºã§ããã¡ãªã¼ã¯ãªã¹ãã¹ï¼
ã¯ããã«
å¦çæä»£ã«å¼·åå¦ç¿ã®ç ç©¶ãå°ããã®ã§ãä¹ ãã¶ãã«å¼·åå¦ç¿ããããããªãã®æ°æã¡ã§æ¬è¨äºãæ¸ãã¾ãããä»åã¯Gymnasiumã¨ããPythonã©ã¤ãã©ãªã使ç¨ããçµé¨ãç´¹ä»ãããã¨æãã¾ãï¼
å¼·åå¦ç¿ã¨ã¯ï¼
å¼·åå¦ç¿(Reinforcement Learning)ã¯ç°¡åã«ããã¨ãã¨ã¼ã¸ã§ã³ããç°å¢ã®ä¸ã§ã¢ã¯ã·ã§ã³ã¨å®è¡ãããã®çµæããå¦ç¿ããã¨ã¼ã¸ã§ã³ããããããçµæãåºåã§ããããã«ã¢ã¯ã·ã§ã³ã鏿ããæ¹æ³ãæé©åãããã¨ã§ãã

(Gymnasium-Basic Usageãã)
ä¾ãã°ãã¬ã¼ã·ã³ã°ã²ã¼ã ã§èªåé転ã®ã¨ã¼ã¸ã§ã³ããä½ãã¨ããã:
- Action: å éãæ¸éãæ¹åãå¤ãã
- Observation: ã´ã¼ã«ã¾ã§ã®è·é¢ãå£ã®æ¹åã¨è·é¢ãçµéæé...
- Reward: èµ°è¡ããè·é¢ãå£ã«ã¶ã¤ããåæ°...
ã¨ã¼ã¸ã§ã³ããã¢ã¯ã·ã§ã³ã決ããã¢ã«ã´ãªãºã ã¯å¼·åå¦ç¿åéã®éè¦ãããã¯ã§ããQ-LearningãPPOãSACãæè¿ã§ãDAPOãªã©ãã¢ã«ã´ãªãºã ã®é²åé度ã¯ã¨ã¦ãæ©ãã§ãããã®è¨äºã¯æ·±æããã¾ããããèå³ãããã¾ããããã²èª¿ã¹ã¦ã¿ã¦ãã ããï¼
Gymnasiumã¨ã¯ï¼
2021å¹´ã«OpenAI Gym library ã®éçºãã¼ã ã Gymnasium ã«ç§»è»¢ãã¾ããã
å¼·åå¦ç¿ç°å¢éçºã»è¨ç·´ã«ç¹åãã Python ã©ã¤ãã©ãªã§ãã

(https://gymnasium.farama.org/ ãã)
ä»åã¯å ¬å¼ããã¥ã¡ã³ãã«ããä¾ã®ããããæ¹é ãã¦ããã®ãããä¸ã«ãã¬ã¤ã¤ã¼ãã´ã¼ã«ããã©ãããããã¾ãããã¬ã¤ã¤ã¼ãåããã¦ããã©ãããã§ããã ãè¸ã¾ãªãããã«ãçµç¹ã¾ã§ç§»åããã¿ã¹ã¯ãAIã«å¦ç¿ããããã®å¦ç¿çµæãè©ä¾¡ãããæãã¾ãã

ç°å¢ã®ä½æ
ã¾ããGymnasiumç°å¢ã¯ã©ã¹ gymnasium.Env 3ãã¼ãã§æ§æããã¾ãï¼åæåãæ´æ°ãæç»ã§ããããã¨ä¸è¨ã®æ å ±ã¯ã³ã³ã¹ãã©ã¯ã¿ã§å®ç¾©ãã¾ãï¼
- Observation ã®æå¤§æå°å¤(observation space)
- å¯è½ãª Action(action space)
- æç»ããããã°ãã©ã¡ã¿ã¼(metadata)
- ã©ã³ãã ã·ã¼ã(np_random) â ç¹å®çµæãåç¾ãããæã«ä½¿ãã¨ä¾¿å©
åæå
å¦ç㯠reset() ã«å ¥ãã¾ããç°å¢ãåæç¶æ ã«ããå¦çã§ããæåã® Observation ãè¿ãã¾ããå¦ç㯠1 episode(ã¿ã¹ã¯éå§ãçµäº)ãã¨ã«å®è¡ããã¾ããä»åã¯ãã¬ã¤ã¤ã¼ãã´ã¼ã«ããã©ããã®ä½ç½®ãã©ã³ãã ã«çæããå¦çãæ¸ãã¾ãããObservation ã¯ãã¬ã¤ã¤ã¼ãã´ã¼ã«ããã©ããã®ä½ç½®ã«ãã¾ãã
æ´æ°
å®è¡ããActionã step() 颿°ã«æ¸¡ãã¦ãç°å¢ã¯ã©ãå¤åãããã¨ãã®è¡åã¯ãããã©ãããè¿ããä»åã® Action ã¯ç§»åæ¹åã§ã颿°å ã«ã¯ãã¬ã¤ã¤ã¼ã®ä½ç½®ãæ´æ°ãã´ã¼ã«ã«ã¤ãããç¹æ°ãä¸ããããã©ãããè¸ãã ãæ¸ç¹ã«ãã¾ããã
æç»
render() 颿°ã«å®ç¾©ããã¾ããä»åã¯éã丸ããã¬ã¤ã¤ã¼ã赤ã丸ãã´ã¼ã«ãé»ã丸ããã©ãããããã¨ãã¹ã PyGame ã©ã¤ãã©ãªã§æç»ãã¾ããã
ããã±ã¼ã¸
ç°å¢ã§ããããå©ç¨ããããããããã±ã¼ã¸åãã¾ããããã±ã¼ã¸åããã¨ã便å©ãª Wrapper ã使ãã¾ããObservation ãä»ã®å½¢å¼ã«å¤æ´ããï¼ä¾ãã°ããã¬ã¤ã¤ã¼ã¨ã´ã¼ã«ã®ä½ç½®ã§ã¯ãªãããã®ç¸å¯¾ä½ç½®ã«å¤æ´ï¼ã«ã¯ãã使ããã¾ãã
ã¨ã¼ã¸ã§ã³ã
Observation ãè¦ã¦ãã©ããã Action ãåãããæ±ºããã«ã¼ã«ã§ããQ-LearningãPPOãSAC ã¨ããã¢ã«ã´ãªãºã ã®é¨åã§ããèªåã§å®è£ ããã®ãããããå ¬å¼ããããã®ã©ã¤ãã©ãª Stable-Baselines3 ã使ãã¨ä¾¿å©ã§ããããã±ã¼ã¸ããã Gym ç°å¢ãããã°ä¸è¨ã®ããã«ç°¡åã«ã¢ãã«ä½ãã¾ããè¤æ°ç°å¢ã§ä¸¦è¡å¦ç¿ãã§ãã¾ãã
å¦ç¿
ç°å¢ããè¿ããç¹æ°ã§ä»åé¸ãã Actionãè©ä¾¡ãã¦ã次åãObservationãæ¥ãæã«åãActionãé¸ã¶ãã©ãããã¢ã«ã´ãªãºã ãä¿®æ£ãã¾ããStable-Baselines3ã®ããã¥ã¡ã³ããåèããã°åãã©ã¡ã¿ã¼ã®æå³ãæ¸ããã¦ãã¾ãã®ã§ãããã¯å²æãã¾ãã
ãã°
Stable-Baselines3 㯠TensorBoard åºåã§ãã¾ããTensorBoard ã使ç¨ãã¦ãã°ãåºåã§ãã¾ããå¦ç¿ã®éç¨ãå³ã§è¡¨ç¤ºã§ãã¾ããå¦ç¿è¶³ãã¦ãããã©ããã®ç¢ºèªããã©ã¡ã¿ã¼èª¿æ´ããéã«ããªãåèã«ãªãã¾ããã

è¨ç·´çµæãè©ä¾¡
Stable-Baselines3ã§å¦ç¿ããã¢ãã«ãä¿åã§ãã¾ãããã®ä¿åããã¢ãã«ããã¼ãããã°ãä¸ãã Observation ã«å¯¾ã㦠Action ãåºåãã¾ãã
<gist>
ããï¼æå¼·ã®AIãä½ã£ãã®ã§ãæ©éã¢ãã«ã試ããï¼
ããããªããåããªãã ãã©......
ç§»åããªã
æåã®åé¡ã¯ãå£ã«ã¶ã¤ãããç§»åããªããªã£ãåé¡ã§ããããã解決ããããã«ããååã¨åãä½ç½®ãªãæ¸ç¹ãããRewardã追å ãã¦ã¿ã¾ããã...

ãã£ã¨é£ã®ãã¹ã«è¡ã£ãããã£ãã
ãåãå ´æãããªããã°ãããã¨ã¨ã¼ã¸ã§ã³ãããã®æãéãè¦ã¤ãã¦ããã£ã¨åããã¹ã«è¡ã£ãããã£ãããã¦ãã¾ãã...

AIã¡ãããã¡ããã¨åããããããããã§ã«çµéããå ´æã«ç§»åããã¨æ¸ç¹ããï¼ãããã« Reward ä¿®æ£ãã¾ããã
è²ã
調æ´ããçµæãããã£ã½ãåãã«ãªã£ãï¼
æã
ãã©ããã«è¸ãã§ãããé¿ããããã«é å¼µã£ã¦ããæãã¦ã¾ãã

ããåºãããã
15x15ã®ãããã§æ¹ãã¦å¦ç¿ããã¾ããï¼10x10ããå¦ç¿æé2åããã£ãããææã¯æªããªãã¨æãã¾ãã

Github Repo
ä»å使ç¨ããã³ã¼ããGitHubã«ã¢ãããã¼ãããã®ã§ãèå³ãããæ¹ã¯ãã²è§¦ã£ã¦ã¿ã¦ãã ããï¼
æå¾ã«
調æ´ãã¦AIãã©ãã©ãæé·ãããã®ã楽ããã§ãããé©åãªRewardãè¨å®ããã®ãé£ããã¨å®æãã¾ãããããã¨ãå人çãªææ³ã§ãããActionãé£ç¶ã¹ãã¼ã¹ãªã¿ã¹ã¯ã®æ¹ã徿ãªã¤ã¡ã¼ã¸ãããã¾ããã¨ãªãã¨Actionã®è¨è¨ãéè¦ã«ãªã£ã¦ãã¾ããããã£ã¨è¤éãªã¿ã¹ã¯ããããã¦æ¬²ãããªã£ãï¼
AIã ãã§ã¯ãªããå¼·åå¦ç¿ã®æ¦å¿µã¨ã¢ã«ã´ãªãºã ã¯ã²ã¼ã éçºä¸ã«ãæ´»ç¨ã§ããã¨æãã¾ããåå¼·ã«ãªãã¾ããï¼
ææ¥ã¯25æ¥ï¼ã¯ãªã¹ãã¹å½æ¥ã«è»æ¹ãç´ æµãªè¨äºãå ¬éããäºå®ã§ããã¿ãªãããã²èªãã§ã¿ã¦ãã ãããï¼