kaggleã§å¼·åå¦ç¿ããã£ã¦ã¿ã
æ¦è¦
ç¾å¨ãkaggle ã« Connect X ã¨ããå¼·åå¦ç¿ã® Getting Started ã³ã³ã ãããã¾ãããã®ã³ã³ããéãã¦å¼·åå¦ç¿ãå°ãåå¼·ããã®ã§ããã®å 容ãè¨è¼ãããã¨æãã¾ãã
ãã¡ãã®æ¸ç±ããã¨ã«å¼·åå¦ç¿ã«ã¤ãã¦çè§£ãããã¨ã¨ãConnect Xã³ã³ãã§ã®å®è£ ã解説ããè¨äºã«ãªãã¾ããééããããã°ãã³ã¡ã³ãããã ãããå¬ããã§ãã
å¼·åå¦ç¿ã¨ã¯
å¼·åå¦ç¿ã¨ã¯ãè¡åããå ±é ¬ãå¾ãããç°å¢ã«ããã¦ãåç¶æ³ã§å ±é ¬ã«ç¹ãããããªè¡åãåºåããããã«ãã¢ãã«ã使ãããã¨ã
æå¸«ããå¦ç¿ã¨ã®éãã¯é£ç¶ããè¡åã«ãã£ã¦å¾ãããå ±é ¬ãæå¤§åãããã¨ããç¹ã§ããå²ç¢ãèããæãããå±é¢ã§æªæã«è¦ãã䏿ããå ã é²ããã¨å®ã¯è¯ãæã ã£ããã¨ãã£ãå ´åã®ãã®ä¸æã鏿ã§ããããã«ããã®ãå¼·åå¦ç¿ã«ãªãã¾ãã
Connect X ã¨å¼·åå¦ç¿
ããããåç®ä¸¦ã¹ã²ã¼ã ã§ãã対æ¦ç¸æããå ã«ãèªåã®ãã¼ã¹ãç¸¦ã»æ¨ªã»æãã®ããããã§ãï¼ã¤æãããããåã¡ã«ãªãã¾ãã
æåºãããã¡ã¤ã«ã¯é常ã®ãããªcsvãã¡ã¤ã«ã§ã¯ãªããã¨ã¼ã¸ã§ã³ãã®æ¯ãèããè¨è¼ããã¦ããPythonãã¡ã¤ã«ãæåºãã¾ãã
Connect X ã®ã«ã¼ã«ããµã¾ããå¼·åå¦ç¿ã§ã®èããæ´çãã¾ãã
ã¨ã¼ã¸ã§ã³ã
åç®ä¸¦ã¹ãè¡ããã¬ã¼ã¤ã¼
è¡å Action
ãã¼ã¹ãå
¥ãããã¨
ConnectXã§ã¯ããã¼ã¹ã¯ããã§ãã«ã¼ããåãé¸ã¶ãã¨ãããããããã¨è¡¨ç¾ã
ç¶æ State
ã²ã¼ã ãã¼ãä¸ã®ãã§ãã«ã¼ã®é
ç½®ã
(以éã®è¨è¼ã§ã¯ã ãç¾å¨ã®ç¶æ
ã
ãæ¬¡ã®STEPã®ç¶æ
ã¨è¡¨ãã¦ãã)
å ±é ¬ Reward
ã²ã¼ã çµäºæã«åã¤ã¨ 1 ããè² ãã㨠0 ããã©ã¡ãã§ããªãå ´å (å¼ãåãã»åè² ãã¤ãã¦ããªã) ã 㨠0.5 ãå ±é ¬ã¨ãã¦å¾ããã¾ãã
è¡åå¾ããã«å¾ãããå ±é ¬ãå³æå ±é ¬ã¨å¼ã³ã¾ãã
ã¾ããæéå²å¼ãããå ±é ¬ã®ç·åã以ä¸ã®ããã«è¡¨ãã¾ãã
ã¯æé (æ/ã¹ããã)ã
ãæéå²å¼ç
10æã§åå©ããå ´åã¨ã 20æã§åå©ããå ´åã§ã¯ãåè ã®æ¹ãããè¯ããã®ã¨è©ä¾¡ãããããã
ããã¯å帰çã«è¡¨ããã¨ãå¯è½ã
å ±é ¬é¢æ° Reward Function
å ±é
¬ãè¿ã颿°ã
é·ç§»é¢æ° Transition Function
ç¾å¨ã®ç¶æ
ã¨è¡åãããããç¶æ
ã«ãªã確çã¨ãé·ç§»å
ãè¿ã颿°ã
é·ç§»é¢æ°ãç¶æ
é·ç§»ç¢ºç ãåºåããé·ç§»å
ã¯ç¶æ
é·ç§»ç¢ºçã®é«ããã®ã¨ãªãã
Connect X ã§ã¯ãã²ã¼ã ä¸é¸æå¯è½ãªActionãããå ´åãå¿
ãæ³å®éãã®ç¶æ
ã«é·ç§»ããã®ã§èæ
®ããªããã®ã¨ãã¾ãã
æ¦ç¥ Policy
ããç¶æ
ã§æ¬¡ã®è¡å
ãæ±ºãã颿°ã
é·ç§»é¢æ°ã¨ä¼¼ã¦ãã¾ãããPolicyã¯å®éã«èµ·ããè¡åãæ±ºãããã®ã§ããã®è¡åãèµ·ããã¨ã©ã®ãããªç¶æ
ã«ãªãã®ããå®ãã¦ããã®ãé·ç§»é¢æ°ã§ãã
å¼·åå¦ç¿ã®ç¨®é¡
ã¢ãã«ãã¼ã¹
é·ç§»é¢æ°ã¨å ±é
¬é¢æ°ããã¼ã¹ã«å¦ç¿ãããã¨ãã¢ãã«ãã¼ã¹ã¨ããã¾ãã
ããç¶æ
ã§æ¦ç¥
ã«åºã¥ãã¦è¡åãããã¨ã§å¾ããã価å¤
ãã以ä¸ã®ããã«è¡¨ããã¨ãã§ãã¾ãã
æå¾
å¤ ã¯ãè¡å確ç (æ¦ç¥ããæ±ºã¾ã) ã¨é·ç§»ç¢ºçãããããã¨ã§å°ãåºããã¨ãã§ãã¾ãã
価å¤ãæå¤§ã«ãªããããªè¡åã常ã«é¸æããæ¹æ³ã Value ãã¼ã¹ã¨ãããè¡åã®è©ä¾¡æ¹æ³ã®ã¿ãå¦ç¿ãã¾ããããã¨ã¯å¥ã«ãæ¦ç¥ã«ãã£ã¦è¡åãæ±ºå®ãããã®æ¦ç¥ã®è©ä¾¡ã¨æ´æ°ã«è¡åè©ä¾¡ãä½¿ãæ¹æ³ã Policy ãã¼ã¹ã¨ããã¾ãã
ä¸è¨ã®å¼ã«ããã¦ã次ã®STEPã«ãããä¾¡å¤ ãè¨ç®æ¸ã¿ã§ãªãã¨ãããªãããã§ãããå
¨ã¦ã®è¡åã«å¯¾ãã価å¤ãè¨ç®ããã®ã¯ãã¿ã¼ã³ãå¤ãå ´åã¯å®¹æã§ã¯ãªããããåçè¨ç»æ³ DP ãç¨ãããã¾ãã
ã¢ãã«ãã¼ã¹ã§ã¯ã¨ã¼ã¸ã§ã³ãã䏿©ãåããã¨ãªããç°å¢ã®æ å ±ã®ã¿ã§æé©ãªè¨ç» (æ¦ç¥) ãå°ããã¨ãã§ãã¾ãããã ããããã¯é·ç§»é¢æ°ã¨å ±é ¬é¢æ°ãæ¢ç¥ (ãããã¯æ¨å®ãå¯è½) ã§ããå¿ è¦ãããã¾ãããã®ãããä¸è¬çã«ã¯ã¢ãã«ãã¼ã¹ã§ã¯ãªãã¢ãã«ããªã¼ã使ããã¾ããä»åã® Connect X ã§ãã¢ãã«ããªã¼ã§ã®ã¢ããã¼ãã«ãªããããã¢ãã«ãã¼ã¹ã®è©³ç´°ã«ã¤ãã¦ã¯å²æãã¾ãã
ã¢ãã«ããªã¼
ã¨ã¼ã¸ã§ã³ããèªãåãããã®çµé¨ã使ã£ã¦å¦ç¿ãããã¨ãã¢ãã«ããªã¼ã¨ããã¾ãã
çµé¨ã¨ã¯ãè¦ç©ãã£ã¦ããä¾¡å¤ ã¨ãå®éã«è¡åãã¦ã¿ãæã®ä¾¡å¤
ã®å·®åã®ãã¨ã§ãã
代表çãªãã®ã«ãã¢ã³ãã«ã«ãæ³ã¨TDæ³ãããã¾ããTDæ³ã¯1STEPé²ãã ãã誤差 (TD誤差) ãå°ããããæ´æ°ãè¡ããã¢ã³ãã«ã«ãæ³ã¯ã¨ãã½ã¼ãçµäºã¾ã§STEPãé²ãã¦ããã誤差ãå°ããããæ´æ°ãè¡ãã¾ãã
TDæ³ã® ã®æ´æ°ã®ä»æ¹
ã¢ã³ãã«ã«ãæ³ã® ã®æ´æ°ã®ä»æ¹
TDæ³ã®ä»£è¡¨çãªãã®ã«Q-learningãããã¾ããããç¶æ
ã«ãããããè¡åããããã¨ã®ä¾¡å¤ã ã¨è¡¨ãQå¤ã¨è¨ãã¾ããQ-learningã¯æ¦ç¥ã使ç¨ããã«ã価å¤ãæå¤§ã¨ãªãç¶æ
ã«é·ç§»ããè¡åãã¨ãã価å¤è©ä¾¡ãæ´æ°ãããã Off-Policy (æ¦ç¥ããªã)ã¨è¨ãã¾ããããã«å¯¾ããSARSAã¨ããæ¹æ³ã¯è¡åã®æ±ºå®ãæ¦ç¥ã«åºã¥ããã®ã§ãããæ¦ç¥ãæ´æ°ãããããOn-Policy ã¨è¨ãã¾ããæ¦ç¥ãActorãæ
å½ãã価å¤è©ä¾¡ãCriticãæ
å½ãã¦äº¤äºã«æ´æ°ãè¡ãActor Criticæ³ã¨ãããã®ãããã¾ãã
Connect X
å¼·åå¦ç¿ã«ã¤ãã¦å¤§ã¾ãã«çè§£ããã¨ããã§ãConnect X ã®ç°å¢ã触ã£ã¦ã¿ããã¨æãã¾ãã
ã¤ã³ã¹ãã¼ã«
ConnectX ã³ã³ãã®ç°å¢ã使ããããã以ä¸ã®ã©ã¤ãã©ãªãã¤ã³ã¹ãã¼ã«ãã¾ãã
>> pip install kaggle-environments
ã©ã¤ãã©ãªã®ä½¿ãæ¹
make ã§ã²ã¼ã ç°å¢ã®ã¤ã³ã¹ã¿ã³ãçæããrender ã§ ã²ã¼ã ãã¼ãã®ç¶æ ã表示ãããã¨ãã§ãã¾ãã
from kaggle_environments import make, utils env = make("connectx", debug=True) env.render()
configuration ã«ãã²ã¼ã ã®æ§ææ å ±ãããã¾ããåã 7 ã§è¡ã 6 ã®ãã¼ãã§ãã§ãã«ã¼ã 4 ã¤æãããè¯ããã¨ããããã¾ãã
print(env.configuration) >> {'timeout': 5, 'columns': 7, 'rows': 6, 'inarow': 4, 'steps': 1000}
ã¨ãã½ã¼ããçµäºããã¨ãdone ã True ãè¿ãã¾ãã
対æ¦ç¸æãã©ã³ãã ã¨ãã¦ããã¬ã¼ãã¼ã使ããã²ã¼ã ãåæå (ãªã»ãã) ããæ¯å 0 åç®ã«ãããããã¦ã¿ã¾ãã
trainer = env.train([None, "random"]) state = trainer.reset() print(f"board: {state.board}\n"\ f"mark: {state.mark}") while not env.done: state, reward, done, info = trainer.step(0) print(f"reward: {reward}, done: {done}, info: {info}") board = state.board env.render(mode="ipython", width=350, height=300)
>> board: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] >> mark: 1 >> reward: 0.5, done: False, info: {} >> reward: 0.5, done: False, info: {} >> reward: 0.5, done: False, info: {} >> reward: 1, done: True, info: {}
- state.board ã«ã¯ããã¼ãä¸ã®é ç½®ãã·ãªã¢ã«åãããé åãå¾ããã¾ã
- state.mark ã§èªåã®ãã§ãã«ã¼ã 1 ã 2 ãå¤å¥ã§ãã¾ã
- trainer.step() ã«èªåãããããããåãæ¸¡ãã¨ãç¸æãããããããå¾ã® state ã¨reward ãã²ã¼ã ã®çµäºå¤å®ãã©ã°ãå¾ããã¾ã
- ãã§ã«6ã¤ãã§ãã«ã¼ãé ç½®ããã¦ããåã«ããããããã¨ãInvalid Action ã¨ãªã reward Nan ã§ã²ã¼ã çµäºã¨ãªãã¾ã
- renderã®mode ã ipython ã«ãã㨠jupyter notebook ä¸ã§ãã¬ã¤åç»ã®åçãã§ãã¾ã
è©ä¾¡ææ¨
ã¬ã¦ã¹åå¸ ã§ã¢ãã«åããã
ã®å¤ãã¹ãã«è©ä¾¡ã¨ãã¦LBã«åæ ããã¦ãã¾ãããµããããããã¨ã
㯠600ã§åæåããã¦ãå
¨ã¨ã¼ã¸ã§ã³ãã®ãã¼ã«ã«å
¥ãããã¾ããåã¨ã¼ã¸ã§ã³ã㯠1 æ¥æå¤§ 8 ã¨ãã½ã¼ãåãèªåã®è©ä¾¡ã¨è¿ãããã®ã¨å¯¾æ¦ãè¡ãã¾ãããã®å¯¾æ¦ã§è² ããã¨
ã®å¤ãå°ãããªããåã¤ã¨
ã®å¤ã大ãããªããå¼ãåãã ã¨ä¸¡è
ã®å¹³åã¨ãªãã¾ããå¤ã®æ´æ°ã¯ãããããã®åå·®ãèæ
®ããå¤ã«ãªã
ãæ´æ°ããã¾ããã¾ããæ°ããã¨ã¼ã¸ã§ã³ãã®å ´åã¯ãã¬ã¼ããå°ãä¸ãã¦åºæ¥ãã ãæ©ããé©åãªå¤ã«ãªãããã«èª¿æ´ãã¦ããããã§ãã
æ°ããªã¨ã¼ã¸ã§ã³ãã使ããã¨ãããµããããåã«ç¾å¨ã®LBã®ããã å¤ã®è¨ç®ãããã®ã¯é£ããã§ããããããã«ãããå¼·ãã¨ã¼ã¸ã§ã³ãã¯å¾ã
ã« LB ãç»ã£ã¦ãããè² ãç¶ããã¨ä¸ãã£ã¦ããããã«ãªã£ã¦ãã¾ãã
ã¨ã¼ã¸ã§ã³ãã®ä½æ
Connect X ã³ã³ãã§ã¯ãã¨ã¼ã¸ã§ã³ãã®æ¯ãèããè¨è¼ããã Python ãã¡ã¤ã«ãæåºããå¿ è¦ãããã®ã§ãã¨ã¼ã¸ã§ã³ãã使ãã¦æåºãã¦ã¿ã¾ãã
ä¸çªä¸ã 0 (空) ã§ããåã®ä¸ãããã©ã³ãã ã« 1 ã¤é¸ã¶ã ãã®ã¨ã¼ã¸ã§ã³ãã使ãã¾ãã
from random import choice def my_agent(state, configuration): return choice([c for c in range(configuration.columns) if state.board[c] == 0])
evaluate ã«ãã²ã¼ã åã¨ã¨ã¼ã¸ã§ã³ãã¨ã¨ãã½ã¼ãæ°ã渡ãã¨ã対æ¦çµæãå¾ããã¾ãã
以ä¸ã®åºåã 㨠2 å 1 æã§ãã
from kaggle_environments import evaluate print(evaluate("connectx", [my_agent, "random"], num_episodes=3)) >> [[1, 0], [0, 1], [1, 0]]
submission.py ãã¡ã¤ã«ã« my_agent ãåºåãã¾ãã
import inspect import os def write_agent_to_file(function, file): with open(file, "a" if os.path.exists(file) else "w") as f: f.write(inspect.getsource(function)) write_agent_to_file(my_agent, "submission.py")
ããã¯æåºãã¡ã¤ã«ã®ã¨ã¼ã¸ã§ã³ããæ£å¸¸ã«åä½ãããã®ç¢ºèªã³ã¼ãã§ãããµããããããåã«ã確èªãã¦ããã¾ãã
import sys out = sys.stdout submission = utils.read_file("{æåºãã¡ã¤ã«Path}") agent = utils.get_last_callable(submission) sys.stdout = out env = make("connectx", debug=True) env.run([agent, agent]) print("Success" if env.state[0].status == env.state[1].status == "DONE" else "Failed") >> Success
ãã¡ã¤ã«ãåºåããããããã¤ãã¨åãããã«ãã¡ã¤ã«ãã¢ãããã¼ããã¾ãã
é常ã¨åãããkernelããæåºãããã¨ããAPIã§æåºãããã¨ãã§ãã¾ãã
LBä¸ã®ãã£ã¹ãã¬ã¤ã¢ã¤ã³ã³ãã¯ãªãã¯ããã¨ãLBä¸ã§ã®å¯¾æ¦åç»ãã¿ãã¾ãï¼ãã®ãããªä»ã®ã³ã³ãã¨ã¯éãã¨ããã¯ãé¢ç½ãã§ããã
Q-Learning ã®å®è£
ããç¶æ
ã§ããè¡åãè¡ããã¨ã®ä¾¡å¤ãQå¤ ã¨è¡¨ãããã®Qå¤ãå¦ç¿ããæ¹æ³ã§ãããQ-Learning ã Connect X ã«ç¨ã«å®è£
ãã¦ã¿ã¾ãã
Qãã¼ãã«
Qå¤ãæ ¼ç´ãã¦ããQãã¼ãã«ã®å®è£
- Q : Qãã¼ãã«ãdictã§ãkeyã«ç¶æ ã, valueã«å ¨actionã®Qå¤ãé åã§æ ¼ç´ãã¦ãã
- get_state_key : Qãã¼ãã«ã®keyã§ãããç¶æ (èªåãã©ã¡ãã®ãã§ãã«ã¼ããå å³) ã state_key (16鲿°)ã§è¡¨ã
- get_q_values : ããç¶æ ã§ã®å ¨actionã®Qå¤ãé å (0 ~ 6: ããããããåé ) ã§è¿ã颿°
- update : ããç¶æ ã«ãããããã¢ã¯ã·ã§ã³ã«å¯¾ãã¦æ´æ°ãããã
class QTable(): def __init__(self, actions): self.Q = {} # Qãã¼ãã« self.actions = actions def get_state_key(self, state): # 16鲿°ã§ç¶æ ã®keyãä½ã board = state.board[:] board.append(state.mark) state_key = np.array(board).astype(str) return hex(int(''.join(state_key), 3))[2:] def get_q_values(self, state): # ç¶æ ã«å¯¾ãã¦ãå ¨actionã®Qå¤ã®é åãåºå state_key = self.get_state_key(state) if state_key not in self.Q.keys(): # éå»ã«ãã®ç¶æ ã«ãªã£ããã¨ããªãå ´å self.Q[state_key] = [0] * len(self.actions) return self.Q[state_key] def update(self, state, action, add_q): # Qå¤ãæ´æ° state_key = self.get_state_key(state) self.Q[state_key] = [q + add_q if idx == action else q for idx, q in enumerate(self.Q[state_key])]
Agent ã®å®è£
- policy function : Qãã¼ãã«ããã¨ã«ãããç¶æ ã«ãããQå¤ãæå¤§ãªactionã鏿ãã
- custom_reward : Qãã¼ãã«ã®ä½æããããã¾ãããããã«å ±é ¬é¢æ°ãã«ã¹ã¿ãã¤ãº
- learn : ã¨ãã½ã¼ããã¨ã«Qãã¼ãã«ãæ´æ°ãã¦å¦ç¿ããã
- q_table : ç¶æ x è¡å ã«å¯¾ãã¦ã価å¤ãæ ¼ç´ããã Q ãã¼ãã«
- reward_log : å ±é ¬ã®å±¥æ´
ãã©ã¡ã¼ã¿
- episode_cnt : å¦ç¿ã«ä½¿ãã¨ãã½ã¼ãæ°
- epsilon : æ¢ç´¢ãè¡ã(Qå¤ã«å¾ããªã)ããã«ãã確ç, ã¯ããã¯å¤§ããã¦å¾ã ã«å°ãããªãããã«å®è£
- gamma : æéå²å¼ç
- learn_rate : å¦ç¿ç
env = make("connectx", debug=True) trainer = env.train([None, "random"]) class QLearningAgent(): def __init__(self, env, epsilon=0.99): self.env = env self.actions = list(range(self.env.configuration.columns)) self.q_table = QTable(self.actions) self.epsilon = epsilon self.reward_log = [] def policy(self, state): if np.random.random() < self.epsilon: # epsilonã®å²åã§ãã©ã³ãã ã«actionã鏿ãã return choice([c for c in range(len(self.actions)) if state.board[c] == 0]) else: # ã²ã¼ã ä¸é¸æå¯è½ã§ãQå¤ãæå¤§ãªactionã鏿ãã q_values = self.q_table.get_q_values(state) selected_items = [q if state.board[idx] == 0 else -1e7 for idx, q in enumerate(q_values)] return int(np.argmax(selected_items)) def custom_reward(self, reward, done): if done: if reward == 1: # åã¡ return 20 elif reward == 0: # è² ã return -20 else: # å¼ãåã return 10 else: return -0.05 # åè² ãã¤ãã¦ãªã def learn(self, trainer, episode_cnt=10000, gamma=0.6, learn_rate=0.3, epsilon_decay_rate=0.9999, min_epsilon=0.1): for episode in tqdm(range(episode_cnt)): # ã²ã¼ã ç°å¢ãªã»ãã state = trainer.reset() # epsilonãå¾ã ã«å°ãããã self.epsilon = max(min_epsilon, self.epsilon * epsilon_decay_rate) while not env.done: # ã©ã®åã«ãããããããæ±ºããã¦å®è¡ãã action = self.policy(state) next_state, reward, done, info = trainer.step(action) reward = self.custom_reward(reward, done) # 誤差ãè¨ç®ãã¦Qãã¼ãã«ãæ´æ°ãã gain = reward + gamma * max(self.q_table.get_q_values(next_state)) estimate = self.q_table.get_q_values(state)[action] self.q_table.update(state, action, learn_rate * (gain - estimate)) state = next_state self.reward_log.append(reward)
çµæ
# å¦ç¿ qa = QLearningAgent(env) qa.learn(trainer) # ã²ã¼ã çµäºæã«å¾ãããå ±é ¬ã®ç§»åå¹³å import seaborn as sns sns.set(style='darkgrid') pd.DataFrame({'Average Reward': qa.reward_log}).rolling(500).mean().plot(figsize=(10,5)) plt.show()
æ´æ°ããã q_table ã«å¦ç¿ã§å¾ããã Q å¤ãã reward_log ã«å ±é
¬ã®å±¥æ´ (åæ) ãå¾ããã¾ãã
å ±é
¬ã®ç§»åå¹³åãã¿ãã¨ãå¾ã
ã«åçãä¸ãã£ã¦ããã®ã確èªã§ãã¾ããã¡ããã¨å¦ç¿ã§ãã¦ããããã§ãï¼
Pythonãã¡ã¤ã«ã¸ã®åºå
ã¾ããã¨ã¼ã¸ã§ã³ãã®æ¯ãèãããã1ã¤ã®é¢æ°ã¨ãã¦Pythonãã¡ã¤ã«ã¸åºåãããããQãã¼ãã«ã®ãã¼ã¿ãæååã«å¤æãã以ä¸ã®ã³ã¼ãã§Pythonãã¡ã¤ã«ã«æ¸ãè¾¼ãéã«dictã¨ãã¦æ±ããããã«ãã¦åºåãã¾ãã
tmp_dict_q_table = qa.q_table.Q.copy() dict_q_table = dict() # å¦ç¿ããQãã¼ãã«ã§ãä¸çªQå¤ã®å¤§ããActionã«ç½®ãæãã for k in tmp_dict_q_table: if np.count_nonzero(tmp_dict_q_table[k]) > 0: dict_q_table[k] = int(np.argmax(tmp_dict_q_table[k])) my_agent = '''def my_agent(observation, configuration): from random import choice # 使ãããã¼ãã«ãæååã«å¤æãã¦ãPythonãã¡ã¤ã«ä¸ã§dictã¨ãã¦æ±ããããã«ãã q_table = ''' \ + str(dict_q_table).replace(' ', '') \ + ''' board = observation.board[:] board.append(observation.mark) state_key = list(map(str, board)) state_key = hex(int(''.join(state_key), 3))[2:] # Qãã¼ãã«ã«åå¨ããªãç¶æ ã®å ´å if state_key not in q_table.keys(): return choice([c for c in range(configuration.columns) if observation.board[c] == 0]) # Qãã¼ãã«ããæå¤§ã®Qå¤ãã¨ãActionã鏿 action = q_table[state_key] # é¸ãã Actionããã²ã¼ã ä¸é¸ã¹ãªãå ´å if observation.board[action] != 0: return choice([c for c in range(configuration.columns) if observation.board[c] == 0]) return action ''' with open('submission.py', 'w') as f: f.write(my_agent)
Qãã¼ãã«ã®ä½ãæ¹ã»ãã¡ã¤ã«åºåã®ä»æ¹ã¯ãã¡ãã®kernelãåèã«ãã¾ãã.
ConnectX with Q-Learning | Kaggle
Deep Q-Net ã®å®è£
å¼·åå¦ç¿ã«ãã£ã¼ãã©ã¼ãã³ã°ã使ã£ã代表çãªDeep Q-Netã«ã¤ãã¦ãConnect X ç¨ã«å®è£
ãã¦ã¿ã¾ãã
åºæ¬çãªèãæ¹ã¯Q-learningã¨åãã§ãQãã¼ãã«ã§è¡ãªã£ã¦ãã価å¤ã®è©ä¾¡ã«ãCNNãç¨ãã¾ãã
inputã¯ç¶æ
ã§ãoutputã¯actionã®ä¾¡å¤ã§ãLoss颿°ã§TD誤差ãæå°åããããããã«å®è£
ãã¾ãã
ã¾ãããã¾ãå¦ç¿ãè¡ãããã® 3 ã¤ã®ãã¯ããã¯ãããã¾ãã
Experience Replay
ã¨ã¼ã¸ã§ã³ãã®è¡åå±¥æ´ã貯ãã¦ããããããããµã³ããªã³ã°ãã¦å¦ç¿ã«å©ç¨ãã¾ããè¡åå±¥æ´ã¨ã¯ [ ç¶æ , è¡å, å ±é ¬, é·ç§»å ã®ç¶æ , ã¨ãã½ã¼ãã®çµäºãã©ã° ] ã®ã¾ã¨ã¾ãã«ãªãã¾ãããã¾ãã¾ãªã¨ãã½ã¼ãã®ç°ãªãã¿ã¤ãã³ã°ã®ãã¼ã¿ã使ãããã¨ã§ãå¦ç¿ãå®å®ããããã¨ãã§ãã¾ãã
Fixed Target Q-Network
é·ç§»å ã®ä¾¡å¤ãè¨ç®ããå ´åãç¾å¨ã®æ´æ°ãã¦ããã¢ãã«(CNN)ã¨åããã®ã使ç¨ããã¨éã¿ãæ´æ°ãããã³ã«éã£ãå¤ã«ãªã£ã¦ãã¾ããTD誤差ãå®å®ããªããã®ã«ãªã£ã¦ãã¾ãã¾ããä¸å®æéãæ´æ°ãã¦ããªãCNNã¢ãã«ããé·ç§»å ã®ä¾¡å¤ãè¨ç®ããããã¿ã¤ãã³ã°ã§æ´æ°ãããããã¨ãã£ãæ¹æ³ãã¨ãã¾ãã価å¤ã®è©ä¾¡ã®ããã«æ´æ°ãç¶ãã¦ããCNNã¨é·ç§»å ã®ä¾¡å¤è¨ç®ç¨ã®CNNã2 ã¤ã使ã£ã¦å¦ç¿ãã¾ãã
Clipping
å ±é ¬ããæåã 1 , 失æã -1 , ãã以å¤ã¯ 0 ã«çµ±ä¸ãã¾ãã
CNN ã®å®è£
価å¤è©ä¾¡ãè¡ãããã®CNNãå®è£ ãã¾ããä¸è¨ãFixed Target Q-Network ã使ãããã価å¤è©ä¾¡ç¨ã®CNNã¨é·ç§»å 価å¤è¨ç®ç¨ã®CNNã両æ¹ãã®CNNã使ãã¾ãã
ä»åã¯ãåç®ä¸¦ã¹ã¨ããå°ããã²ã¼ã ãã¼ããªã®ã§ããããã¯ã¼ã¯æ§æãç³ã¿è¾¼ã¿2層ã®å°ããCNNã«ãã¦ã¿ã¾ãããinput ã¯ç¶æ
ã®ã²ã¼ã ãã¼ãã®ãã§ãã«ã¼ã®é
ç½®ã2次å
(7, 6) ã§ãã®ã¾ã¾å
¥ãã¦ã¾ããoutput 㯠action ã® value (7) ã§ãã
import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F class CNN(nn.Module): def __init__(self, outputs=7): super(CNN, self).__init__() self.conv1 = nn.Conv2d(1, 16, 3) self.bn1 = nn.BatchNorm2d(16) self.conv2 = nn.Conv2d(16, 32, 3) self.bn2 = nn.BatchNorm2d(32) self.fc = nn.Linear(192, 32) self.head = nn.Linear(32, outputs) def forward(self, x): x = F.relu(self.bn1(self.conv1(x))) x = F.relu(self.bn2(self.conv2(x))) x = x.view(x.size()[0], -1) x = self.fc(x) x = self.head(x) return x
Deep Q-Net ã® Agent ã®å®è£
ã¨ã¼ã¸ã§ã³ãã®å®è£ ããã¾ããQ-lerningã§ã®å®è£ ã®éãã¯ã以ä¸ã® 4 ç¹ã§ãã
- è¦ç©ãã価å¤ã¨ãå®éã«è¡å価å¤ã®èª¤å·®(TD誤差)ãæå°åããã¨ãããCNNã«ãã
- CNNã«å ¥ããããããã«ããã§ãã«ã¼ã®é ç½®ã (1, 7, 6) ã® Tensorã«å¤æããã¨ããã¨
- èªåã®ãã§ãã«ã¼ã 1 ãç¸æã®ãã§ãã«ã¼ã 0.5 ã« ãããã¨
- ä¸è¨ã®ãã¯ãã㯠Experience Replay, Fixed Target Q-Network, Clipping ã使ç¨ãããã¨
class DeepQNetworkAgent(): def __init__(self, env, lr=1e-2, min_experiences=100, max_experiences=10_000, channel=1): self.env = env self.model = CNN() # 価å¤è©ä¾¡ç¨ã®CNN self.teacher_model = CNN() # é·ç§»å 価å¤è©ä¾¡ç¨ã®CNN self.optimizer = optim.Adam(self.model.parameters(), lr=lr) self.criterion = nn.MSELoss() self.experience = {'s': [], 'a': [], 'r': [], 'n_s': [], 'done': []} # è¡åå±¥æ´ self.min_experiences = min_experiences self.max_experiences = max_experiences self.actions = list(range(self.env.configuration.columns)) self.col_num = self.env.configuration.columns self.row_num = self.env.configuration.rows self.channel = channel def add_experience(self, exp): # è¡åå±¥æ´ã®æ´æ° if len(self.experience['s']) >= self.max_experiences: # è¡åå±¥æ´ã®ãµã¤ãºã大ããããæã¯å¤ããã®ãåé¤ for key in self.experience.keys(): self.experience[key].pop(0) for key, value in exp.items(): self.experience[key].append(value) def preprocess(self, state): # ç¶æ ã¯èªåã®ãã§ãã«ã¼ã1, ç¸æã®ãã§ãã«ã¼ã0.5ã¨ãã7x6夿¬¡å é åã§è¡¨ã result = np.array(state.board[:]) result = result.reshape([self.col_num, self.row_num]) if state.mark == 1: return np.where(result == 2, 0.5, result) else: result = np.where(result == 2, 1, result) return np.where(result == 1, 0.5, result) def estimate(self, state): # 価å¤ã®è¨ç® return self.model( torch.from_numpy(state).view(-1, self.channel, self.col_num, self.row_num).float() ) def future(self, state): # é·ç§»å ã®ä¾¡å¤ã®è¨ç® return self.teacher_model( torch.from_numpy(state).view(-1, self.channel, self.col_num, self.row_num).float() ) def policy(self, state, epsilon): # ç¶æ ãããCNNã®åºåã«åºã¥ããæ¬¡ã®è¡åã鏿 if np.random.random() < epsilon: # æ¢ç´¢ return int(np.random.choice([c for c in range(len(self.actions)) if state.board[c] == 0])) else: # Actionã®ä¾¡å¤ãåå¾ prediction = self.estimate(self.preprocess(state))[0].detach().numpy() for i in range(len(self.actions)): # ã²ã¼ã ä¸é¸æå¯è½ãªactionã«çµã if state.board[i] != 0: prediction[i] = -1e7 return int(np.argmax(prediction)) def update(self, gamma): # è¡åå±¥æ´ãååã«èç©ããã¦ããã if len(self.experience['s']) < self.min_experiences: return # è¡åå±¥æ´ããå¦ç¿ç¨ã®ãã¼ã¿ã®idããµã³ããªã³ã°ãã ids = np.random.randint(low=0, high=len(self.experience['s']), size=32) states = np.asarray([self.preprocess(self.experience['s'][i]) for i in ids]) states_next = np.asarray([self.preprocess(self.experience['n_s'][i]) for i in ids]) # 価å¤ã®è¨ç® estimateds = self.estimate(states).detach().numpy() # è¦ç©ããã®ä¾¡å¤ future = self.future(states_next).detach().numpy() # é·ç§»å ã®ä¾¡å¤ target = estimateds.copy() for idx, i in enumerate(ids): a = self.experience['a'][i] r = self.experience['r'][i] d = self.experience['done'][i] reward = r if not d: reward += gamma * np.max(future[idx]) # TD誤差ãå°ããããããã«CNNãæ´æ° self.optimizer.zero_grad() loss = self.criterion(torch.tensor(estimateds, requires_grad=True), torch.tensor(target, requires_grad=True)) loss.backward() self.optimizer.step() def update_teacher(self): # é·ç§»å ã®ä¾¡å¤ã®æ´æ° self.teacher_model.load_state_dict(self.model.state_dict())
Deep Q-Net ã® Trainer ã®å®è£
åºæ¬çã«ãQ-learning ã¨å¤ããã¾ããã
è¡åå±¥æ´ãããã¦ããå¦çã¨ãä¸å®ã®ééã§ä¾¡å¤è©ä¾¡ç¨ã®CNNã®ãã©ã¡ã¼ã¿ãé·ç§»å
価å¤è¨ç®ç¨ã®CNNã«ã³ãã¼ãã¦ããå¦çã追å ããã¦ãã¾ãã
class DeepQNetworkTrainer(): def __init__(self, env): self.epsilon = 0.9 self.env = env self.agent = DeepQNetworkAgent(env) self.reward_log = [] def custom_reward(self, reward, done): # Clipping if done: if reward == 1: # åã¡ return 1 elif reward == 0: # è² ã return -1 else: # å¼ãåã return 0 else: return 0 # åè² ãã¤ãã¦ãªã def train(self, trainer,epsilon_decay_rate=0.9999, min_epsilon=0.1, episode_cnt=100, gamma=0.6): iter = 0 for episode in tqdm(range(episode_cnt)): rewards = [] state = trainer.reset() # ã²ã¼ã ç°å¢ãªã»ãã self.epsilon = max(min_epsilon, self.epsilon * epsilon_decay_rate) # epsilonãå¾ã ã«å°ãããã while not env.done: # ã©ã®åã«ãããããããæ±ºãã action = self.agent.policy(state, self.epsilon) prev_state = state state, reward, done, _ = trainer.step(action) reward = self.custom_reward(reward, done) # è¡åå±¥æ´ã®èç© exp = {'s': prev_state, 'a': action, 'r': reward, 'n_s': state, 'done': done} self.agent.add_experience(exp) # 価å¤è©ä¾¡ã®æ´æ° self.agent.update(gamma) iter += 1 if iter % 100 == 0: # é·ç§»å 価å¤è¨ç®ç¨ã®æ´æ° self.agent.update_teacher() self.reward_log.append(reward)
çµæ
å®éã« Deep Q-Net Agentã§å¦ç¿ãã¦ã¿ã¾ãã
dq = DeepQNetworkTrainer(env) dq.train(trainer, episode_cnt=30000) # çµæã®æç» import seaborn as sns sns.set() sns.set_palette("winter", 8) sns.set_context({"lines.linewidth": 1}) pd.DataFrame({'Average Reward': dq.reward_log}).rolling(300).mean().plot(figsize=(10,5))
å ±é ¬ã®å±¥æ´ããåæã®ç§»åå¹³åãã¿ã¦ã¿ãã¨ãå¾ã ã«åã¦ãããã«ãªã£ã¦ãã¦ããã¾ãå¦ç¿ã§ãã¦ãããã§ãã(ããã»ã©ã®Q-learningã¨ã¯å ±é ¬é¢æ°ãç°ãªãã®ã§ãy軸ã®ã¹ã±ã¼ã«ãç°ãªãã¾ã)
ä»åã20,000ã¨ãã½ã¼ãå¦ç¿ããã¾ããããä»ã®æ¹ã®kernelãè¦ãã¨3000ã¨ãã½ã¼ããããã§ãã¾ãå¦ç¿ããããã¦ãã人ãããã®ã§ãCNNããã©ã¡ã¼ã¿ã調æ´ãã¦ä¸æãæ©ãå¦ç¿ã§ããããã«å·¥å¤«ããæ¹ãè¯ãã®ããããã¾ããã
ããã
å¼·åå¦ç¿åå¿è ã®åå¼·ã®å ´ã¨ãã¦ãkaggle ã® Connect X ã¯æé©ã ã¨æãã¾ããï¼kaggle ã® notebook ãç«ã¡ä¸ããã°ããã«ã¨ã¼ã¸ã§ã³ããåãããç°å¢ãæ´ãã®ã¯ã¨ã¦ã便å©ã§ããå¦ç¿æ¸ã¿ã¨ã¼ã¸ã§ã³ããã©ãè¨è¼ãããã¨ããæ©ã¾ããåé¡ã¯ããã®ã§ãã(å¤é¨ãã¡ã¤ã«ã®èªã¿è¾¼ã¿ãå¦ç¿ããã¢ãã«ã®èªã¿è¾¼ã¿ãã§ããªã)ãGetting Started ã³ã³ããªã®ã§ãæ°è»½ã«åå ã§ãã¦æ¥½ããã£ãã§ãã
Connect X ã®å®è£ ãã¡ã¤ã³ã«ãªããå¼·åå¦ç¿ã®çè«ã«ã¤ãã¦ã¯ã¾ã åå¼·ä¸è¶³ãªã®ã§ãå¼ãç¶ãå¦ãã§ããããã§ãã
åå¼·ä¼ã®ãç¥ãã
Wantedly ã§ã¯æ¯é±æ¨ææ¥18:30ããæ©æ¢°å¦ç¿ã®åå¼·ä¼ãéãã¦ãã¾ãããç¾å¨ã社å¡ãååãªã¢ã¼ãã¯ã¼ã¯ã®ãããªã³ã©ã¤ã³ (hangouts) ã§éå¬ãã¦ãã¾ãï¼ãªã³ã©ã¤ã³ã ããããåå ãããããã¨æãã¾ãã®ã§ãèå³ãããæ¹ã¯ãæ¯éï¼
ã¾ããã«ã¸ã¥ã¢ã«é¢è« (ç¾å¨ãªã³ã©ã¤ã³)ã»ã¤ã³ã¿ã¼ã³ãåéãã¦ãã¾ãï¼
www.wantedly.com
ããããæ¸ç±
ä»åã¯ãæ¦è¦ã§ãè¨è¼ããã¨ãããæ©æ¢°å¦ç¿ã¹ã¿ã¼ãã¢ããã·ãªã¼ãºã® Python ã§å¦ã¶å¼·åå¦ç¿ã§åå¼·ãã¾ãããPythonã³ã¼ããè¨è¼ããã¦ãã¦åããããããããããå¼·åå¦ç¿ãåå¼·ãã人ã«ã¯ã´ã£ããã ã¨æãã¾ãï¼ãã®è¨äºã§ã¯æ±ã£ã¦ããªãäºãããããè¨è¼ããã¦ããã®ã§ãæ°ã«ãªã£ãæ¹ã¯æ¯éãèªãã§ã¿ããã¨ããããããã¾ãã
ã¾ããhakubishin ããããã以ä¸ã®æ¸ç±ãããããã¨ç´¹ä»ãã¦ããã ãã¾ããï¼å¼·åå¦ç¿ãåå¼·ãããæ¹ã®åèã«ãªãã°ã¨æãã¾ãã
www.kinokuniya.co.jp
honto.jp