ããã«ã¡ã¯ãã»ãããã§ãï¼
å¼·åå¦ç¿ã£ã¦ç¥ã£ã¦ã¾ããï¼
ãAlphaGoããç¢ã®ä¸ççè
ãç ´ã£ãã
ãªã©ã¨æè¿èå
ãæµ´ã³ã¦ããæ©æ¢°å¦ç¿æè¡ã§ããã
ç§ã®ããã°ã§ãä½åãé¢é£è¨äºãåºãã¦ããã®ã§ããã ä»åã¯ãChainerã§å¼·åå¦ç¿ãç°¡åã«æ±ãããã¼ã«ããChainerRLãã使ã£ã¦ã¿ã¾ããï¼
ãªããªã便å©ã ã£ãã®ã§ä½¿ãæ¹ã解説ãã¾ã¨ããTipsãå ããªããã¡ã¢ãã¦ã¿ã¾ããã (ã³ã¼ãã¯jupyteræºæ ãªã®ã§ãä¸ããé çªã«ã³ããããã¨åºæ¬ã¯åãã¯ãã§ã)
ããããå¼·åå¦ç¿ãã£ã¦ã¿ããã¨ãã人ã¯ãããã§å¼·åå¦ç¿ãã©ããªãã®ã試ãã¦ã¿ãã®ãããããããã¾ããï¼
- å¼·åå¦ç¿ã£ã¦ï¼
- chainerrl
- Setup
- å¿ è¦ãªã©ã¤ãã©ãªãimportãã
- environmentã®è¨å®
- Agentã®è¨å®
- æé©åææ³, ãã©ã¡ã¼ã¿ã®è¨å®
- å®è¡(å¦ç¿)
- ãã¹ããã
- å¦ç¿ããã¹ããåã£ãè¨å®ãããªãç°¡åã«
- ãã¾ã1 : GPUã使ã£ã¦ã¿ã
- ãã¾ã2 obsããã®å¯è¦å
- çµããã«
- é¢é£
å¼·åå¦ç¿ã£ã¦ï¼
ãã®è¨äºãä¸çªããããããã¦åãã«ã¯è¯ãã
chainerrl
- chainerã®å¼·åå¦ç¿ç¨ã¢ã¸ã¥ã¼ã«
- æ¢åã®chainerã®ãããã¯ã¼ã¯ã使ããªãããææ°ã®å¼·åå¦ç¿(DQN)ã使ãã.
quickstartã«è²ã ã¨èª¿ã¹ããã¨ãå ããªãããå®éã«åããã¦ã¿ãã
Setup
pip install chainerrl
ãããã¯ã½ã¼ã¹ã³ã¼ããããã½ã¼ã¹ã³ã¼ãã¯ãã
git clone https://github.com/pfnet/chainerrl.git cd chainerrl python setup.py install
å¿ è¦ãªã©ã¤ãã©ãªãimportãã
- chainer : Deeplearningç¨ã®ã©ã¤ãã©ãª
- chainerrl : chainerå¼·åå¦ç¿ç¨ã©ã¤ãã©ãª
- gym : å¼·åå¦ç¿ã®å®é¨ç°å¢
import chainer import chainer.functions as F import chainer.links as L import chainerrl import gym import numpy as np
environmentã®è¨å®
- ChainerRLã使ãããã«ã¯ãç°å¢ã¢ãã«ã"environments"ã¨ãã¦ä¿åãã¦ããå¿ è¦ããã
OpenAI Gym
ã«ãããã®ã¯ãã®ã¾ã¾gym.make(hogehoge)
ã§ä½¿ããããã«ãªã£ã¦ãã- hogehogeã®ä¸ã«ããããããªã¹ãã¯ãã®ãã¼ã¸ OpenAI/envs
environmentã«å¿ è¦ãªè¦ä»¶(æä½é)
- observation spaceã: ããæå»ã§ã®ç¶æ ãå ¥å
- action space : ããæå»tã§é¸ã¶è¡å
- 2ã¤ã®method (resetã¨step)
env.reset
: åæåenv.step
: å®è¡- è¡åãå®è¡ãã次ã®ç¶æ ã¸ç§»ã
- 4ã¤ã®å¤(次ã®è¦³æ¸¬, å ±é ¬, 試è¡ã®æã¡åããå¦ã, 追å æ å ±)
ä»å使ããã®
CartPoleï¼Vã
åç«æ¯ãåã誰ããåä¾ã®é ã«ãã£ããã»ãããæã«ä¹ãã¦ããã®ã¾ã¾ä¿ã¤ãçãªãã¤ã å ãã¿è«æã¯ãã¡ãããã¶ãè«æã¯è³¼èªããªãã¨èªããªãï¼
- observation : [cartã®é度, ä½ç½®, poleã®è§é度, è§åº¦]
- action : å³ã«é²ã or å·¦ã«é²ã
env = gym.make(“CartPole-v0”)
print("observation space : {}".format(env.observation_space)) print("action space : {}".format(env.action_space)) obs = env.reset() #åæå #env.render()#ã¬ã³ããªã³ã°ããç°å¢ãè¦ãã¦ããã print("initial observation : {}".format(obs)) action = env.action_space.sample() obs, r, done, info = env.step(action) ### ã©ããªå¤ãå ¥ã£ã¦ããã®ã確èªï¼ print('next observation : {}'.format(obs)) print('reward : {}'.format(r)) print('done : {}'.format(done)) print('info : {}'.format(info))
env.render()
ã使ãã¨ããããªæãã®åç»ãåºã¦ãããOpenAI gymã®ç°å¢ããæã£ã¦ãããã¤ãªããããã©ã§ä½¿ããã¿ãããªã®ã§ã確èªããã人ã¯ããããã
Agentã®è¨å®
ç°å¢ãã¤ãããã®ã§ã次ã¯ç°å¢ä¸ãåãagentãä½ã£ã¦ãã
ChainerRLã§ããã©ã§å®è£ ããã¦ããagent
ã»ã¼ææ°ã¨è¨ããã¢ã«ã´ãªãºã ãæã£ã¦ãã & éçºãçããªã®ã§ãstate of the artãåºãã°åãå ¥ãã¦ãããã
対å¿ãã¦ãããã®ã®è¡¨(READMEããå¼ç¨)
Algorithm | Discrete Action | Continous Action | Recurrent Model | CPU Async Training |
---|---|---|---|---|
DQN (including DoubleDQN etc.) | o | o (NAF) | o | x |
DDPG | x | o | o | x |
A3C | o | o | o | o |
ACER | o | o | o | o |
NSQ (N-step Q-learning) | o | o (NAF) | o | o |
PCL (Path Consistency Learning) | o | o | o | o |
ä»åã¯DQNã使ããDQNã¯AlphaGoã§ã使ããããã¤ã
Qé¢æ°ã®è¨è¨
å¼·åå¦ç¿ã使ãã®ã«éè¦ãªQé¢æ°(ä»ã®ç¶æ ã¨è¡ã£ãè¡åã®çµæã§ãã©ã®ãããã®å ±é ¬ãè¦è¾¼ããã)ã決ããå¿ è¦ãããã DQNãªã©ã§ã¯ãå ¥åããQé¢æ°ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ã§è¿ä¼¼ãã
- chaineRLã§ã¯
chainer.Link
ã¨ãã¦Qé¢æ°ãå®ç¾©ãããã¨ãã§ãã - åºåã¯
chainerrl.action_value.DiscreteActionValue
ã§ã©ããããã¦ãã
観測ããå
¥å(次å
æ°obs_size
)ããã次ã®è¡å(n_actions
)ã決å®ããé¢æ°ã®è¨è¨
class QFunction(chainer.Chain): def __init__(self, obs_size, n_actions, n_hidden_channels=50): #super(QFunction, self).__init__(##python2.xç¨ super().__init__(#python3.xç¨ l0=L.Linear(obs_size, n_hidden_channels), l1=L.Linear(n_hidden_channels,n_hidden_channels), l2=L.Linear(n_hidden_channels, n_actions)) def __call__(self, x, test=False): """ x ; 観測#ããã®è¦³æ¸¬ã£ã¦ãstateã¨action両æ¹ï¼ test : ãã¹ãã¢ã¼ããã©ããã®ãã©ã° """ h = F.tanh(self.l0(x)) #æ´»æ§åé¢æ°ã¯èªåã§æ¸ãã®ï¼ h = F.tanh(self.l1(h)) return chainerrl.action_value.DiscreteActionValue(self.l2(h)) obs_size = env.observation_space.shape[0] n_actions = env.action_space.n q_func = QFunction(obs_size, n_actions) # q_func.to_gpu(0) ## GPUã使ããã人ã¯ãã®ã³ã¡ã³ããå¤ã
(åè)predifined Q-functions
äºãè¨è¨ããã¦ã¦ããQé¢æ°ã使ããã¨ãå¯è½
_q_func = chainerrl.q_functions.FCStateQFunctionWithDiscreteAction( obs_size, n_actions, n_hidden_layers=2, n_hidden_channels=50)
æé©åææ³, ãã©ã¡ã¼ã¿ã®è¨å®
AgentãDQNã§åããããã®ç¨®ã ã®è¨å®ããã
- optimizer ä½ã使ã£ã¦æé©åããããchainerã«ããããã¨çµã¿è¾¼ã¾ãã¦ãããoptimizersãªã¹ãã¯ãã¡ã
- gamma å ±é ¬ã®å²å¼ç.éå»ã®çµæãã©ã®ãããéè¦è¦ããã
- explorer 次ã®æ¦ç¥ãèããã¨ãã®æ¹æ³
- replay_buffer Experience Replayãå®è¡ãããã©ãã
optimizer = chainer.optimizers.Adam(eps=1e-2) optimizer.setup(q_func) #è¨è¨ããqé¢æ°ã®æé©åã«Adamã使ã gamma = 0.95 explorer = chainerrl.explorers.ConstantEpsilonGreedy( epsilon=0.3, random_action_func=env.action_space.sample) replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity = 10**6) phi = lambda x:x.astype(np.float32, copy=False)##åã®å¤æ(chainerã¯float32åãfloat64ã¯é§ç®) agent = chainerrl.agents.DoubleDQN( q_func, optimizer, replay_buffer, gamma, explorer, replay_start_size=500, update_frequency=1, target_update_frequency=100, phi=phi)
å®è¡(å¦ç¿)
ç°å¢ãAgentåã³ãããæ´æ°ããDQNãå®æããã®ã§ããã¨ã¯å®è¡ãã¦ãã
import time n_episodes = 200 max_episode_len = 200 start = time.time() for i in range(1, n_episodes + 1): obs = env.reset() reward = 0 done = False R = 0 # return (sum of rewards) t = 0 # time step while not done and t < max_episode_len: # åããè¦ãããã°ããã®ã³ã¡ã³ããå¤ã # env.render() action = agent.act_and_train(obs, reward) obs, reward, done, _ = env.step(action) R += reward t += 1 if i % 10 == 0: print('episode:', i, 'R:', R, 'statistics:', agent.get_statistics()) agent.stop_episode_and_train(obs, reward, done) print('Finished, elapsed time : {}'.format(time.time()-start))
ãã¹ããã
trainingã¯å®äºããã®ã§ãtestãå®éã«ãã£ã¦ã¿ãã
ãã¹ããªã®ã§ãæå¾agent.stop_episode_and_train(obs, reward, done)
ã¯å¼ã°ãªãã
for i in range(10): obs = env.reset() done = False R = 0 t = 0 while not done and t < 200: # env.render() action = agent.act(obs) obs, r, done, _ = env.step(action) R += r t += 1 print('test episode:', i, 'R:', R) agent.stop_episode()
ããã¾ã§ã§ä¸éãã®æµãã¯å®äºã§ãï¼ï¼ã¢ãã«ãä¿åããããã°ãagent.save("hoge")
ã¨ããã°ä¿åãã§ãã¾ãã
å¦ç¿ããã¹ããåã£ãè¨å®ãããªãç°¡åã«
å¦ç¿ã»ãã¹ãããã¡ãã¡æ¸ãã®ãé¢åãªã®ã§ãããããæã¯âã®é¢æ°ãæã¤
ããã ãã§ãä¸çºã§å®è¡ãã¦ãããã
chainerrl.experiments.train_agent_with_evaluation( agent, env, steps=2000, # 2000stepå¦ç¿ eval_n_runs=10, # è©ä¾¡(ãã¹ã)ã10åãã max_episode_len=200, # ããããã®è©ä¾¡ã«å¯¾ããé·ãã®æ大(200s) eval_frequency=1000, # ãã¹ãã1000stepã®å¦ç¿ãã¨ã«å®æ½ outdir='result') # 'result'ãã©ã«ãã«ä¿å
ãã¾ã1 : GPUã使ã£ã¦ã¿ã
ä¸è¡ã§GPUã使ããã¨ãã§ãã âããq_funcãå®ç¾©ããå¾ã«æã¿è¾¼ã
q_func.to_gpu(0)
ã¨ã©ã¼
å®è¡æã«ãããªã¨ã©ã¼ãåºããã
OSError: Failed to run `nvcc` command. Check PATH environment variable: [Errno 2] No such file or directory: 'nvcc'
ãã¶ããCUDAã使ãPATHãéã£ã¦ããªãã®ãåå ãªã®ã§ããã§ãã¯ãã¦ãã¹ã追å ãã¦ããã pythonå ã§ãã¹ã追å ãããªããã¡ã
import os print(os.environ["PATH"]) #ãã§ãã¯ãPATHã«CUDAããªããã¨ãç¢ºèª os.environ["PATH"] += ":/usr/local/cuda/bin/" #èªåã®ç°å¢ã§CUDAãå ¥ã£ã¦ãããã¹
å®è¡çµæ
ãããªãã¿ã¹ã¯ãGPU使ã£ã¦ãã
CPU | GPU | |
---|---|---|
å®è¡æé | 561s | 558s |
??? ã©ããããå ¥å次å ãå°ãããã¦ã並ååããã¨ããããªãéããã®ã§ã ãã£ã½ãï¼ãããcommunication costãããã£ã¡ãã£ã¦ããã®ãã
ãã¾ã2 obsããã®å¯è¦å
env.render()
ã¯envãè¨è¨ããã¦ããªãã¨ã§ããªãã®ã§ã観測ããæ®éã«å¯è¦åãã§ããã試ãã¦ã¿ãã
â»1ååã®ãã¹ãçµæãå¯è¦å
import pylab as plt import numpy as np import matplotlib.animation as animation fig = plt.figure(figsize=(10,5)) ims = [] l = 1.0 obs = env.reset() R,t,done = 0, 0, False while not done and t < 200: action = agent.act(obs) print("t:{} obs:{} action:{}".format(t,obs,action)) im = plt.plot([-2,2],[0,0],color="black") im = plt.plot([obs[1],obs[1]+l*np.sin(obs[3])],[0,l*np.cos(obs[3])], "o-",color="blue",lw=4,label="Pole") ims.append(im) obs, r, done, _ = env.step(action) R += r t += 1 # print("test episode : {} R: {}".format(i,R)) agent.stop_episode() plt.legend() plt.xlim(-2.0,2.0) plt.ylim(-1.0,1.0) ani = animation.ArtistAnimation(fig, ims, interval=100) ani.save("animation.gif", writer="imagemagick")
å¯è¦åçµæ
å¦ç¿å
å¦ç¿å¾
ã¡ããã¨å¦ç¿ãã§ãã¦ã¦ç´ æ´ããããDQNãµã¤ã³ã¼ï¼
çµããã«
é£ããã¨ããã¯ã©ãããã¦ããã¦ããã®ã§ãçè«ãããããªãã¦ããã¨ããããåãããã¨ã¯ã§ãã¾ãï¼ ãæ°ããã¢ã«ã´ãªãºã ãä½ãããï¼ï¼ãã¨ãã人ãããªãã¦ããã¨ã«ããä½ãã«ä½¿ã£ã¦ã¿ãããã£ã¦ãã人ã¯ããã®è¾ºããåãã¦ã¿ãã®ãããã¨æãã¾ãã
Chainerã¯éçºé度ãéããããã¢ã«ã´ãªãºã ããã£ããããã«å®è£ ãããã®ã§ãããã§ããã OpenAI Gymã«ããããããå®é¨ç°å¢ãæ´ã£ã¦ããã®ã§ãä»ã®ç°å¢ã試ãã¦ã¿ããã¨æãã¾ãï¼
é¢é£
é¢é£è¨äº
【強化学習】強化学習系のプラットフォームが続々登場! - プロクラシスト
『これからの強化学習』という本が良さそう。遅れをとる日本のDQNを引っ張ってほしい。 - プロクラシスト
【強化学習を使いたいすべての人へ】DQNの実践&理論の勉強もできる記事の紹介(2017/1/7追記) - プロクラシスト
é¢é£æ¸ç±
å¼·åå¦ç¿ã®å¤å ¸çãªæ¬
- ä½è : Richard S.Sutton,Andrew G.Barto,ä¸ä¸è²è³,çå·é ç«
- åºç社/ã¡ã¼ã«ã¼: 森ååºç
- çºå£²æ¥: 2000/12/01
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- è³¼å ¥: 5人 ã¯ãªãã¯: 76å
- ãã®ååãå«ãããã° (29件) ãè¦ã
以åããã°ã§ããããæ°ããå¼·åå¦ç¿ã®æ¬ãDQNã¾ã§ç¶²ç¾ ãã¦ãã
- ä½è : ç§é貴樹,æ¾è°·é·å²,ç½å·çä¸,æµ ç°ç¨,麻çè±æ¨¹,èäºå¹¸ä»£,飯éç,ä¼è¤ç,大ååå,é»æ±åº·æ,ææ¬å¾³å,åªäºç¥å¤ª,é 谷賢治,åç°æ°ä¸,æ¾äºè¤äºé,å泰浩,å®®å´åå ,ç®é»è±ç¾,森æå²é,森æ¬æ·³,ä¿ç°ä¿è¡,åæ¬æ½¤ä¸é
- åºç社/ã¡ã¼ã«ã¼: 森ååºç
- çºå£²æ¥: 2016/10/27
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- ãã®ååãå«ãããã° (2件) ãè¦ã