å¼·åå¦ç¿ã¨ã¯ä½ãã調ã¹ã¦ã¿ã
çããããã«ã¡ã¯
ãå
æ°ã§ãããç§ã¯å
æ°ã§ãã
ä»æ¥ã¯å¼·åå¦ç¿ã®åå¼·ãã¦ãã¡ã¢ãæ¸ãã¦ã¿ã¾ããã
å人çã«ã¯æè¿ã注ç®ãã¦ããåéã§ãã´ã¼ã«ããããããªï¼ã¯ã©ã¹åé¡ãRegressionï¼
æ©æ¢°å¦ç¿ã¨ç°ãªããæ±ç¨çã«è²ã
åé¡ã解ãããã ããã¨ããã®ãçç±ã§ãã
ï¼ãªããè²ã
èªå¼ãçã¿ããï¼
ééã£ã¦ããã°æãã¦ä¸ããã
å¼·åå¦ç¿
å¼·åå¦ç¿ã¯ããç°å¢ãªãã«ãããã¨ã¼ã¸ã§ã³ããã
ç¾å¨ã®ç¶æ ã観測ããåãã¹ãè¡åã決å®ããåé¡ãæ±ã
æ©æ¢°å¦ç¿ã®ä¸ç¨® by wikipedia
ãã®ä¸ã§ç»å ´ããã®ã¯ããç¶æ
ãã¨ãè¡åã決å®ãããã¨ãã£ãã¨ããã§ããããã
ã¤ã¾ãããããç¶æ
ã®æã«ãã©ãè¡åãããããã¨ãã£ããã¨ã解ãåé¡ã¨ãªãã¾ãã
å¼·åå¦ç¿ã«ãããè¦å
å¼·åå¦ç¿ã«ã¤ãã¦èããªããã°ãªããªããã¨ã4ç¹ããã¾ãã
â ããªã·ã¼ï¼policyï¼ã»ã»ã»ã©ã®ããã«è¡åããã
â¡å ±é
¬é¢æ°ï¼reward functionï¼ã»ã»ã»å¼·åå¦ç¿åé¡ã®ã´ã¼ã«ãå®ç¾©ããé¢æ°
â¢å¤é¢æ°ï¼value functionï¼ã»ã»ã»é·æéã«æ¸¡ãè©ä¾¡ææ¨
â£ç°å¢ã¢ãã«ï¼model of the environmentï¼ã»ã»ã»ã¢ã¯ã·ã§ã³ã»ç¶æ
ã®å®ç¾©
Policy
ããç¶æ
ãä¸ããããæã観測ãããç¶æ
ããã©ãè¡åãããè¦åã®ãã¨ã
ä¸çªãç°¡åãªæ¹æ³ã¯Look up tableï¼ããã®ç¶æ
ã®æã¯ããããããããä¸è¦§è¡¨ã§æ±ºã¾ã£ã¦ããï¼
Reward Function
å¼·åå¦ç¿ãå®æ½ããæã«ä½¿ãã´ã¼ã«é¨å
ã¤ãã³ãã«å¯¾ãã¦ãè¯ããæªãããagentã«ä¼ãã
Value function
Value functionã¯é·æçã«è¡åãæªããè¯ãããè©ä¾¡ããææ¨
model of the environment
æè¬è§£ãããåé¡ã§ãããããç¶æ
ã¨ããã«å¯¾ããã¢ã¯ã·ã§ã³ãã©ããã£ãå½¢å¼ã§
å®ç¾©ã§ããã®ãã
ã¤ã¾ããããããçµµã«ããã¨ãããªæãã§ããããã
å¼·åå¦ç¿ã§è§£ããåé¡
å¼·åå¦ç¿ã§ã¯è§£ããåé¡ã¯ä»¥ä¸ã®ãããªåé¡ã§ãã
ããããã®åä½æé©å
強化学習 - Google 検索
è¿·è·¯ã解ã
http://qiita.com/hogefugabar/items/74bed2851a84e978b61c
Alpha Go
AlphaGo - Wikipedia
ã¤ã¾ããç¶æ
ãæã¡ãã´ã¼ã«ãæã£ã¦ããã試è¡é¯èª¤ã§è§£ããåé¡ã«ã¤ãã¦
解ããã¨ãããã§ãããã
ã¾ããn-Armed Bandit Problemãå«ã¾ãããããã£ãé¨åã¯åå¼·ãã¦ããããã¨æãã¾ãã
åèæç®
Richard S. Sutton and Andrew G. BartoãReinforcement Learning: An Introductionã