Random Forestã§è¨ç®ã§ããç¹å¾´éã®éè¦åº¦
(pixabay.comãã)
ï¼ï¼èæ¯ã¨ã
Random Forest[1]ã¨ã¯ãã©ã³ãã ãããã¤å©ç¹ãæ´»ç¨ãã大éã«ä½ã£ã決å®æ¨ãå¹çããå¦ç¿ãããã¨ããæ©æ¢°å¦ç¿ææ³ã®ä¸ç¨®ã§ããSVMãªã©ã®æ¢åã®ææ³ã«æ¯ã¹ã¦ãç¹å¾´éã®éè¦åº¦ãå¦ç¿ã¨ã¨ãã«è¨ç®ã§ãããã¨ãå¦ç¿ãæ©ããã¨ãéå¦ç¿ãèµ·ãã«ãããã¨ï¼è¿½è¨æ³¨éï¼ï¼ãªã©ã®å©ç¹ãæãããã¾ããKinectã®å§¿å¢æ¨å®ã«ä½¿ããã¦ãããããã§ãã
æè¿ãRandom Forestãã«ã¸ã¥ã¢ã«ã«ä½¿ãä¾ãå¤ãï¼ç¹ã«ãã¡ã®ç 究室ï¼ãä¸é¨ãã©ã¡ã¼ã¿ããåºåãããããããªã人ãå¤ãã¨æãã¾ãã使ãæ¹ã¯TJOããã®è³æ[2]ãèªãã§ããããã°ç解ã§ããã¨æããã詳細ã¯æ³¢é¨å
çã®è³æ[3]ãããã§ããããã°ãããã¨æãã¾ãã
ããã§ããããããªæ¥æ¬èªã®è³æããããèªãã§ããRandom Forestããã¤ç¹å¾´ã®ï¼ã¤ã§ãããç¹å¾´éã®éè¦åº¦ã®è©³ç´°ã«é¢ãã¦ã¯ã»ã¨ãã©ãã¼ã¿ããã§ãã
ããã§ããã®è¨äºã§ã¯ç¹å¾´éã®éè¦åº¦ã«ã¤ãã¦æ·±å ããã¦ãããã¨æãã¾ãã
Â
ï¼ï¼ç¹å¾´éã®éè¦åº¦ã®æ¦è¦
ãªãªã¸ãã«ã®å®è£
ãRã§ã®Random Forestã«ãããç¹å¾´éã®éè¦åº¦ã¯è¨ç®æ¹æ³ã¯ä¸»ã«ï¼ç¨®é¡ããã¾ããï¼scikit-learnã¯ã¡ãã£ã¨éãã¿ãããªã®ã§éè¦ãããã°æ¸ãã¾ãã
i) ç¹å¾´éå å·¥ã«ããéè¦åº¦(MeanDecreaseAccuracy)
ii)ã¸ãä¿æ°ã«ããéè¦åº¦(MeanDecreaseGini)
Rã ã¨ä»¥ä¸ã®ããã«æ¸ãã¨è¨ç®ã§ãã¾ãã
How to use the variable important of Random forest ...
試ãã«å®è¡ããã¨ä»¥ä¸ã®ãããªçµæãå¾ããã¾ãã
Sample of variable importance.
åºæ¬çã«å¤ãé«ããã°é«ãã»ã©ããã®å¤æ°ã¯èå¥ã«å½¹ç«ã¤ã¨èãããã¾ãã
ãªããscikit-learnã§ã¯ã¸ãä¿æ°ã«ããéè¦åº¦ãããã©ã«ãã§ã¯å©ç¨ãã¦ããããã§ãã[11]
Â
ï¼ï¼Random Forestã®è±ç¥è
ç¹å¾´éã®éè¦åº¦ã®èª¬æã«å ¥ãåã«ãã¾ããéè¦åº¦ã«é¢ä¿ããRandom Forestã®æ§è³ªã«ã¤ãã¦èª¬æãã¾ãã
Random Forestã§ã¯ãå決å®æ¨ã§ç°ãªããµã³ãã«ã使ã£ã¦å¦ç¿ãã¾ãã
ãããå®ç¾ããããã«ãå¦ç¿ãã¼ã¿ããã®ãã¼ã¿æ°ã ãéè¤ãµã³ããªã³ã°ï¼ãã¼ãã¹ãã©ããæ³ï¼ãããããå¦ç¿ãµã³ãã«ã¨ãã¦æ±ºå®æ¨ãå¦ç¿ããã¾ãããã®ã¨ããå¦ç¿ãã¼ã¿ã®ãã¡ãå¹³åã§1/3ãããã¯å¦ç¿ã«ä½¿ãããªãããã§ãããã®ä½¿ãããªãã£ããã¼ã¿ãout-of-bagï¼OOBï¼ã¨ããã¾ãããã®å¾ã決å®æ¨ã®æ§è½ã調ã¹ãããã«ãOOBãå¦ç¿ãã決å®æ¨ã§åé¡ããã誤èªèçãè¦ã¾ãããããOOBã¨ã©ã¼ã¨ããã¾ããÂ
å³ã決å®æ¨ã®å¦ç¿ï¼å決å®æ¨ã§ãµã³ããªã³ã°ã¨å¦ç¿ãç¹°ãè¿ãï¼
ï¼ï¼ç¹å¾´éå å·¥ã«ããéè¦åº¦
ããã¦ãå ã»ã©ã®å¦ç¿ã®éç¨ã§å¾ãããOOBãç¹å¾´éã®éè¦åº¦ã«ç¨ãããã¨ãèãã¾ããåºæ¬çãªã¢ã¤ãã¢ã¯OOBã®åãµã³ãã«ã®å¤ãã°ãã£ã°ãã£ã«ãããã®ã°ãã£ã°ãã£ãªãµã³ãã«ã§æ¨å®ããããã©ãä½ç²¾åº¦ãä¸ãã£ã¦ãã¾ãã®ãã¨ãããã¨ã§ãã
ãå ·ä½çãªèª¬æãããããã«ãã¾ããããå¦ç¿æ¸ã¿ã®æ±ºå®æ¨ã«ã¤ãã¦èãã¾ãããã®æ±ºå®æ¨ã®OOBã決å®æ¨ã§åé¡ãã¦ããã¾ããããã¨ãOOBã®åãµã³ãã«ã®ãã¡ãééã£ã¦åé¡ããã¦ãã¾ã£ããµã³ãã«ãç¾ãã¾ãããã®ééã£ã¦åé¡ããã¦ãã¾ã£ãçããããã§ã¯OOBã«ããã誤ãçã¨å¼ã¶ãã¨ã«ãã¾ããå³ã«ããã¨ä»¥ä¸ã®ããã«ãªãã¾ããã¡ãªã¿ã«Random Forestã«è©³ãã人ã«æã£ã¦ããã¾ãããOOB error estimateã¨ãOOBã¨ã©ã¼ã¨ãOOB誤ãçã¨ã¯å¥ã§ãã
å³ãOOBã«ããã誤ãçã®å®ç¾©
Â
ããã®OOBã«ããã誤ãçã使ããç¹å¾´éã®éè¦åº¦ãè¨ç®ãã¾ããç¹å¾´éã®ä¸ã®ããå¤æ°ãæå®ãããã®å¤æ°ã®å¤ãOOBã®åãµã³ãã«éã§ã©ã³ãã ã«å ¥ãæ¿ãã¾ããå³ã«ããã¨ä»¥ä¸ã®ããã«ãªãã¾ãã
Â
Â
Â
å³ãç¹å¾´éå å·¥ã«ããéè¦åº¦ã®è¨ç®æ¹æ³
ãã®å¾ãOOBã¨ã©ã¼ OOBã«ããã誤ãçãè¨ç®ããã©ã®ç¨åº¦ç²¾åº¦ãä¸ãã£ãããææ¨ã¨ãã¾ãã
追è¨ï¼ãããå決å®æ¨ã§è¡ããæ¨ãããã®å¹³åãæ±ããã®ã ããã§ã[10]ããã®å¤ãç¹å¾´éã®éè¦åº¦ã¨ãªãã¾ãã
ãã ããRã®åºåå¤ã¯ãã®ã¾ã¾ã®å¤ã§ã¯ãªãããã§ããï¼åæã¯ä»é²åç §ï¼
Â
ï¼ï¼ã¸ãä¿æ°ã«ããéè¦åº¦
Random Forestã§ã¯èå¥ããåã«ã大éã®æ±ºå®æ¨ã«ãã¿ã¼ã³ãå¦ç¿ããã¾ãã
ãã®æ±ºå®æ¨ãä½æããã¨ããã©ã³ãã ã«èª¬æå¤æ°ï¼ç¹å¾´éã®å¤æ°ï¼ãé¸æããå¾ãæ¬æ¥ã¯ã¨ã³ãããã¼ãæ大æå°ã¨ãªãããã«ãµã³ãã«ãåå²ã§ããããé¾å¤ã決ãã¾ãã
ãã ãããã®ææ³ã«ãç¹è¨±ããã(c.f. See 5.0ãªã©)ããã®ã¾ã¾ã§ã¯å
¬éã§ãã¾ããã
ããã§ããã®ä»£æ¿ã¨ãã¦ä¸è¬ã«ã¸ãä¿æ°ãç¨ãã¾ããï¼c.f. CARTï¼
ã¸ãä¿æ°I_Gã¯ã¨ã³ãããã¼ã«ä¼¼ã¦ããã以ä¸ã®å¼ã®ããã«å®ç¾©ããã¾ã[5]ã
ï¼f_iã¯é åiã®ä¸ã«ãããµã³ãã«ã«ä»ããããã©ãã«ã®æ°ãé åã®ä¸ã®ãµã³ãã«ããã¹ã¦åãã©ãã«ã ã£ãå ´åã0ï¼
ã¨ã³ãããã¼ãããããã¸ãä¿æ°ã大ããã»ã©åå²çµæãã°ãã¤ãã¦ãããå°ããã»ã©åå²çµæãã¾ã¨ã¾ã£ã¦ãããã¨ã«ãªãã¾ãã
ãããããã©ã®å¤æ°ãåå²ãããã¨ã§ãã©ãã ãã¸ãä¿æ°ãä¸ãã£ãããèå¥ã«éè¦ã¨ãªã£ã¦ãã¾ãã
ããã§ãã¸ãä¿æ°ã®æ¸å°é(MeanDecreaseGini)ãç¹å¾´éã®éè¦åº¦ã¨è¿ä¼¼çã«è¡¨ç¾ã§ãã¾ãã
CARTã¨ã¸ãä¿æ°ã«é¢ãã¦ã¯ä»¥ä¸ã®ãµã¤ããåèã«ãã¦ãã ããã
追è¨ï¼http://tjo.hatenablog.com/entry/2013/11/21/194654
http://www.teradata-j.com/library/ma/ins_1314a.html
Â
ï¼ï¼æ¢åææ³ã¨æ¯è¼ãã¦ã¿ãï¼ãã£ãï¼
ãã£ãããªã®ã§Rã®{FSelector}ããã±ã¼ã¸ã使ã£ã¦ãç¹å¾´é¸æå ·åãè¦ã¦ã¿ããã¨ã«ãã¾ããã
 ã¨æã£ãã®ã§ãããFSelectorãèªã¿è¾¼ãéã«ãè¬ã®ã¨ã©ã¼ãçºçããã®ã§ãã¾ãä»åº¦ã«ãã¾ãããªããFSelectorã®random.forest.importanceã§Random Forestã«ããç¹å¾´é¸æãã§ãã¾ãã
追è¨ï¼2015/05/05ï¼ï¼
ãå®éã«ãã¦ã¿ãæ¹ãããã£ããã£ãã®ã§ãå¼ç¨ããã¦ããã ãã¾ãã
Â
ï¼ï¼ãããã«
ã¯ã©ã¹ã¿ã§å¦ç¿ã§ãã(c.f. Mahout)ã®ã§ãããããéè¦ãããã¨æãã¾ãããç¹å¾´éã®éè¦åº¦ã使ãã¨ãã¯ããã®æå³ãç¥ã£ãä¸ã§ä½¿ãã¨ãè°è«ããããã¨æãã¾ãããã²ãåèãã ããããã¨ãæ°è»½ã«ã³ã¡ã³ããã ããã
åèæç®
[1] http://oz.berkeley.edu/~breiman/randomforest2001.pdf
[2]Â http://tjo.hatenablog.com/entry/2013/12/24/190000
[3] http://www.habe-lab.org/habe/RFtutorial/CVIM_RFtutorial.pdf
[4] http://penglab.janelia.org/proj/mRMR/
[5] http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
[10]Â http://www.stat.berkeley.edu/~breiman/RandomForests/ENAR.htm
[11] scikit-learnã®ã½ã¼ã¹tree.pyã«ãããDecisionTreeClassifierã¯ã©ã¹ã®ã³ã¡ã³ããã, https://github.com/scikit-learn/scikit-learn/blob/fed4692e4014ef056a63abed67748813b6da8a8a/sklearn/tree/tree.py
Â
ï¼è¿½è¨æ³¨éï¼ï¼æ¨ã®æ°ãå¢ããã°ãéå¦ç¿ã¯èµ·ããªãã¨RFä½è ã主張ãã¦ããã¿ããã§ã[10]ãæ¨ã®æ°ãå¤ãã¨ãã¼ãã¹ãã©ããåæ°ãå¢ããã®ã§ãå¤ãå¤ããµã³ããªã³ã°ãã確çãæ¸ãã¾ãããã®ãããå¤ãå¤ã®å½±é¿ãåãã«ãããªãã®ã§ããããRANSACã¨ä¼¼ã¦ã¾ãããã¾ããæ¨ã®æ·±ãã調æ´ãããã¨ã§éå¦ç¿ãé²ããã¨ãã§ãã¾ãã
Â
ä½è«
OOBã¨ã©ã¼ã¯ãã®è¨äºã«åºã¦ããªãã®ã§é¢ä¿ç¡ãã§ãããRandom Forestãï¼å¹´ããã使ã£ã¦ããã§ããã©ãOOBã¨ã©ã¼ã¨leave one out cross varidationã¨æ¯ã¹ãããOOBã¨ã©ã¼ã®ã»ãã10%ãããèªè精度ãè¯ããªã£ãã±ã¼ã¹ããã£ãããããã¾ããã¦ã«ãªããªããããªæ°ãããããã
Â
ä»é²
å è«æä¸ããã®MeanDecreaseAccuracyã«é¢ããå¼ç¨[1]ã(ä½è ã®äººã¯å³ã«ãã¦ã»ããã£ã)
"Suppose there are M input variables. After each tree is constructed, the values of the mth variable in the out-of-bag examples are randomly permuted and the out-of-bag data is run down the corresponding tree. The classification given for each xn that is out of bag is saved. This is repeated for m=1,2, ... , M. At the end of the run, the plurality of out-of-bag class votes for xn with the mth variable noised up is compared with the true class label of xn to give a misclassification rate."
"The output is the percent increase in misclassification rate as compared to the out-of-bag rate (with all variables intact)."