ï¼â»ã¯ã¦ãªãã©ãã©ã¤ãã®ä¸å ·åã§æ£ãããªãé çªã§ç»åã表示ããã¦ããå¯è½æ§ãããã¾ãï¼
ãã¦ããããªè¨äºãã¯ãªã¹ãã¹ã»ã¤ã´ã®ãã¬ã¼ã³ãã«ããã®ã¯ã¢ã¬ãªãã§ããï¼ç¬ï¼ãæ師ããå¦ç¿ï¼åé¡å¨ç³»ã§ã¯ä¸æ¦ããã§ã·ãªã¼ãºãããäºå®ã§ãã
ããªã飾ãã®ã¯ã©ã³ãã ãã©ã¬ã¹ããã¢ã³ãµã³ãã«å¦ç¿ã®ä»£è¡¨é¸æã§ããããã©ã³ãã ãã©ã¬ã¹ãæå¼·ãã¨ãè¨ã£ã¡ãã人ãå¤ãããããã§ãã*1ããããã人ã«ã¯ãã²ä»åï¼ã¨æ¬¡åäºå®ã®5ååã¾ã¨ãï¼ã®è¨äºãèªãã§ããããããªãã¨æãã¾ãã
ä»åã®åèæç®ããã³ã¯ã®èãæ¬ã§ããpp.193-197ã«æ±ºå®æ¨ããã®ã³ã°ãã¢ããã¼ã¹ãã®å¾ã«ã©ã³ãã ãã©ã¬ã¹ãã®èª¬æãããã¾ãã
ã¯ããã¦ã®ãã¿ã¼ã³èªè
- ä½è : å¹³äºæä¸
- åºç社/ã¡ã¼ã«ã¼: 森ååºç
- çºå£²æ¥: 2012/07/31
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- è³¼å ¥: 1人 ã¯ãªãã¯: 7å
- ãã®ååãå«ãããã° (4件) ãè¦ã
ä»ã ã¨ãä¾ãã°PRMLã«ãã©ã³ãã ãã©ã¬ã¹ãã®èª¬æãè¼ã£ã¦ã¾ããççã«ã¾ã ãã£ãããã®ã§ããããæå±±æ¬ã¨ãè¯ãã¨æãã¾ãã
- ä½è : æå±±å°
- åºç社/ã¡ã¼ã«ã¼: è¬è«ç¤¾
- çºå£²æ¥: 2013/09/18
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- ãã®ååãå«ãããã° (2件) ãè¦ã
ãã©ã³ãã 森ãã®è¨³èªã¯è¾ãã§ããï¼ç¬ï¼ãåãæãã«èª¬æã®ç°¡æ½ãã§ã¯ãã¡ããä¸çªã ã¨æãã¾ããpp.93-99ã«ããã決å®æ¨*2ã®èª¬æããã©ã³ãã ãã©ã¬ã¹ãã«å
¥ãã¨ããã®åãããããã¯è¯ãã§ããããããããããããこのシリーズの第1回が決定木だったã®ã§ããã¡ããä½µãã¦ã©ããã
ã¾ãRã§ã©ããªãã®ãè¦ã¦ã¿ã
ãã¤ãéãã©ã³ãã ãã©ã¬ã¹ãã®é°å²æ°ã ãè¦ãã¨ãããã¨ã§ãããã¾ã§ã¨åæ§ã«GitHubに置いてある以前のサンプルデータã使ãã¾ãããããã£ããã馴æã¿ãã³ã³ãã¼ã¸ã§ã³(CV)ã«å¹ãã¢ã¯ã·ã§ã³(a1-a7)ãæ¢ãåºããã¨ãããã¼ãã§ç¨æããããã¼ã¿ã§ããdã¨ããååã§ã¤ã³ãã¼ããã¦ããã¾ãã
Rã§ã©ã³ãã ãã©ã¬ã¹ãã¨è¨ãã°ãåºæ¬çã«ã¯{randomForest}ããã±ã¼ã¸ã§ããããã¤ãã¤ã³ã¹ãã¼ã«ãã¦ã以ä¸ã®ããã«ãã£ã¦ã¿ã¾ãããã以åã®è¨äºï¼Rで機械学習するならチューニングもグリッドサーチ関数orオプションでお手軽にï¼ã§tuneRF()é¢æ°ã§ãã¥ã¼ãã³ã°ã§ããã¨ç´¹ä»ãã¦ã¾ãã®ã§ããããã¤ãã§ã«ãã£ã¦ã¿ã¾ãã
> require("randomForest") Loading required package: randomForest randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes. > tuneRF(d[,-8],d[,8],doBest=T) mtry = 2 OOB error = 6.43% Searching left ... mtry = 1 OOB error = 9.23% -0.4352332 0.05 Searching right ... mtry = 4 OOB error = 6.6% -0.02590674 0.05 Call: randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 6.4% Confusion matrix: No Yes class.error No 1399 101 0.06733333 Yes 91 1409 0.06066667 # ã¾ãã¯ãã¥ã¼ãã³ã°ããï¼mtry=2ãæé©ããã
ãããªæãã§ãã¥ã¼ãã³ã°ã®çµæãåºã¾ããmtry=2ã¤ã¾ããåã
ã®æ¨ãã¨ã®ç¹å¾´éé¸æã¯2ã¤ãã¤ãæãè¯ããã¨ãããã¨ãªã®ã§ããããrandomForest()é¢æ°ã®å¼æ°ã«å
ã¦ã¾ãã
> d.rf<-randomForest(cv~.,d,mtry=2) # mtry=2ãå¼æ°ã«ãã¦randomForest()é¢æ°ã§åé¡ãã > print(d.rf) Call: randomForest(formula = cv ~ ., data = d, mtry = 2) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 6.37% Confusion matrix: No Yes class.error No 1403 97 0.06466667 Yes 94 1406 0.06266667 # OOB誤差ã6.37%ã¨ã¾ãã¾ã > importance(d.rf) MeanDecreaseGini a1 20.320854 a2 11.490523 a3 2.380128 a4 203.135651 a5 75.415005 a6 783.553501 a7 2.679649 # 決å®æ¨åæ§ã«å¤æ°éè¦åº¦ãåºããããããã¾ãéè¦ > table(d$cv,predict(d.rf,d[,-8])) No Yes No 1409 91 Yes 83 1417 # åé¡æ£ççã¯94.2%ã¨ã¾ãã¾ã
ãã®åé¡æ£çç94.2%ã¨ããã®ã¯ãããã¾ã§åãä¸ãã¦ããSVMããã¥ã¼ã©ã«ãããã¯ã¼ã¯ãããåªããæ°åã§ããã©ã³ãã ãã©ã¬ã¹ããæ©æ¢°å¦ç¿åé¡å¨ã¨ãã¦ããã«åªç§ããåããæ°åã§ã¯ãªããã¨æãã¾ãã
ã©ã³ãã ãã©ã¬ã¹ãã¨ã¯ä½ãã
ç°¡åã«è¨ãã°ãå¼±å¦ç¿å¨ã決å®æ¨ã¨ãããã®ã³ã°ãã§ãããã®ã³ã°ã«ã¤ãã¦ã¯å¤åæå±±æ¬ã®èª¬æããå¼ç¨ããæ¹ãåãããããã¦æã£åãæ©ãã¨æãã¾ãã®ã§ãp.97ããæç²ã
â ã«å¯¾ãã¦ä»¥ä¸ã®å¦çãç¹°ãè¿ãã¾ãã
(a)åã®è¨ç·´æ¨æ¬ãããéè¤ã許ãã¦ã©ã³ãã ã«åé¸ã³ã¾ãã
(b)ãããã¦å¾ãããæ¨æ¬ãç¨ãã¦ãå¼±å¦ç¿å¨ãæ±ãã¾ãã
â¡ãã¹ã¦ã®å¼±å¦ç¿å¨ã®å¹³åãå¼·å¦ç¿å¨ã¨ãã¦æ±ãã¾ãã
ãããããã£ã½ãæ¦å¿µå³ã§è¡¨ãã¨ãããããããããçµµã«ãªãã¾ãã
ã¤ã¾ããã¼ãã¹ãã©ããæ³ããã¾ãå©ç¨ããã¨ããã®ããã®èã§ãããã¼ãã¹ãã©ããæ¨æ¬ããè¨ç®è² è·ã®è»½ãå¼±å¦ç¿å¨ãããããä½ã£ã¦ããã¼ãã¹ãã©ããæ³ãããããã®å¹³åãåããã¨ã§ãempiricalã«ç²¾åº¦ãé«ãã¨æå¾ ãããå¼·å¦ç¿å¨ãä½ããï¼ã¨ããã®ããã®ã³ã³ã»ãããªãã§ããã
ããã§ããã®å¼±å¦ç¿å¨ã¨ãã¦ã¾ãã«è¨ç®è² è·ã®è»½ã決å®æ¨*3ãé¸ãã§ããã¨ããã®ãã©ã³ãã ãã©ã¬ã¹ãã¨ããããã§ãã
ãã ããé常ã®ãã®ã³ã°ã ã¨åã ã®æ±ºå®æ¨éã®ç¸é¢ãé«ããªã£ã¦ãã¾ããåé¡ç²¾åº¦ã®ä½ä¸ã«ã¤ãªããã®ã§ãã©ã³ãã ãã©ã¬ã¹ãã§ã¯èå¥ã«ç¨ããç¹å¾´ãããããã決ããããæ°ï¼ããã{randomForest}ããã±ã¼ã¸ã§è¨ãã¨ããã®mtryå¼æ°ï¼ã ãã©ã³ãã ã«é¸æãããã¨ã§ãç¸é¢ã®ä½ãå¤æ§ãªæ±ºå®æ¨ãçæã§ããããã«ãã¦ãã¾ãï¼ã¯ããã¿p.193ï¼ã{randomForest}ããã±ã¼ã¸ã ã¨ãtuneRF()é¢æ°ã§æé©ãªmtryã®å¤ãã°ãªãããµã¼ãã§æ±ãããã¨ãã§ãã¾ãã
ãã®ã¢ã«ã´ãªãºã ã§ãããä¾ãã°Wikipediaã«ã¯ä»¥ä¸ã®ããã«æ¸ããã¦ãã¾ãã
ã¢ã«ã´ãªãºã
å¦ç¿
- å¦ç¿ãè¡ãªããã観測ãã¼ã¿ãããã©ã³ãã ãµã³ããªã³ã°ã«ããBçµã®ãµããµã³ãã«ãçæããï¼ãã¼ãã¹ãã©ãããµã³ãã«ï¼
- åãµããµã³ãã«ããã¬ã¼ãã³ã°ãã¼ã¿ã¨ããBæ¬ã®æ±ºå®æ¨ãä½æãã
- æå®ãããã¼ãæ°ã«éããã¾ã§ã以ä¸ã®æ¹æ³ã§ãã¼ããä½æãã
- ãã¬ã¼ãã³ã°ãã¼ã¿ã®èª¬æå¤æ°ã®ãã¡ãmåãã©ã³ãã ã«é¸æãã
- é¸ã°ãã説æå¤æ°ã®ãã¡ããã¬ã¼ãã³ã°ãã¼ã¿ãæãè¯ãåé¡ãããã®ã¨ãã®ã¨ãã®é¾å¤ãç¨ãã¦ããã¼ãã®ã¹ããªããé¢æ°ã決å®ãã
è¦ç¹ã¯ãã©ã³ãã ãµã³ããªã³ã°ããããã¬ã¼ãã³ã°ãã¼ã¿ã¨ã©ã³ãã ã«é¸æããã説æå¤æ°ãç¨ãããã¨ã«ãããç¸é¢ã®ä½ã決å®æ¨ç¾¤ãä½æãããã¨ã
ãã©ã¡ã¼ã¿ã®æ¨å¥¨å¤
- : èå¥ã®å ´åã¯1ãå帰ã®å ´åã¯5
- m: 説æå¤æ°ã®ç·æ°ãpã¨ããã¨ãèå¥ã®å ´åã¯ãå帰ã®å ´å㯠p/3
è©ä¾¡
æçµåºåã¯ä»¥ä¸ã®ããã«æ±ºå®ãã
- èå¥: 決å®æ¨ã®åºåãã¯ã©ã¹ã®å ´åã¯ãã®å¤æ°æ±ºã確çåå¸ã®å ´åã¯ãã®å¹³åå¤ãæ大ã¨ãªãã¯ã©ã¹
- å帰: 決å®æ¨ã®åºåã®å¹³åå¤
ï¼wikipedia:Random_forestï¼
ãªããçºæ¡è
ã®Leo Breimanæ¬äººã«ãã解説è¨äºã®webãã¼ã¸ãå
¬éããã¦ããã®ã§ãè±èªãèªãã®ãè¦ã«ãªããªã人ã¯ãã¡ããèªãã§ã¿ãã¨è¯ãã§ãããã
ãã¡ãããããã«è¼ã£ã¦ããè¦ä»¶ãè¦ãªããã¹ã¯ã©ããããã³ã¼ããçµãã¨ããã®ãè¯ãåå¼·ã«ãªãã¨æãã¾ããåã¯é¢åãªã®ã§ããã¾ãããï¼ç¬ï¼ã
ãªããã©ã³ãã ãã©ã¬ã¹ãã¯cross validationã®ä¸ç¨®ã¨ãè¨ããOut-Of-Bag (OOB) error rateã«ãã£ã¦ãã®æ§è½ãè©ä¾¡ãããã¨ãã§ãã¾ããããã¯ã¯ããã¿pp.194-195ã«ãæ¸ããã¦ãã¾ãããè¦ã¯ãããå¦ç¿ãã¼ã¿ã«ã¤ãã¦ãã®å¦ç¿ãã¼ã¿ã使ãããªãã£ã決å®æ¨ã®ã°ã«ã¼ããéãã¦é¨å森ãæ§æããä¸ã§ãã®å¦ç¿ãã¼ã¿ããã¹ããã¼ã¿ã¨ãã¦åé¡ãã¦error rateãç®åºããããã¨ã§å¾ããããã®ã§ããã¡ãªã¿ã«{randomForest}ããã±ã¼ã¸ã ã¨ãåé¡ãã®ã±ã¼ã¹ã§ã¯OOB error rateã§ããããå帰ãã®ã±ã¼ã¹ã§ã¯% variance explainedã¨ãããã¨ã§åæ£èª¬æ度ãè¿ããã¾ãã
ããã¦ãåçä¸ã¯æ±ºå®æ¨ã¨å
¨ãåããªã®ã§ããåé¡ããå帰ãåããå¤æ°éè¦åº¦*4ãå®ãããã¨ãã§ãã¾ããããã¯{randomForest}ããã±ã¼ã¸ãªãimportance()é¢æ°ã§ãã§ãã¯ãããã¨ãã§ãã¾ãããpartialPlot()é¢æ°ã§ãããããããã¨ãã§ãã¾ããããããããæå¾
ãã¦ã¢ãããã¯åæãªã©ã§ä½¿ã人ã¯å¤ãããããªãã§ããããããªããå¤æ°éè¦åº¦ã«ã¤ãã¦ã¯ãã¡ããè¯ãè³æã«ãªããã¨ã
ã¡ãªã¿ã«åã®ãã£ã¤ãä»äºã¿ãããªåçã®èª¬æã«æ¯ã¹ã¦ããã¡ãã®è¨äºã®æ¹ãã©ã³ãã ãã©ã¬ã¹ãã®åçã«ã¤ãã¦ãã£ã¨åãããããæããã¦ããã®ã§ãè¦ãã§ãã
決å®å¢çãæãã¦ã¿ã
ã§ã¯ããã¤ãéããã£ã¦ã¿ã¾ããããéç·å½¢åé¢å¯è½ã¨ããã®ãã©ã³ãã ãã©ã¬ã¹ãã®å©ç¹ãªã®ã§ãããã¾ã§åæ§XORãã¿ã¼ã³ã®åé¡ãããã¾ããããã¾ããã¤ãã®ããã«GitHubããXORãã¿ã¼ã³ã®シンプル版ã複雑版ãæã£ã¦ãã¦ãããããxors, xorcã¨ããååã§ã¤ã³ãã¼ããã¦ããã¾ãã
ã¡ãªã¿ã«ãã®ãã¼ã¿ã§ããã¥ã¼ãã³ã°ãããã¨ã¯ã§ãã¾ãããå®ã¯mtry=1ã§ã2ã§ãè¯ãã¨ããçµæã«ãªãã¾ãããã
> require("randomForest") Loading required package: randomForest randomForest 4.6-7 Type rfNews() to see new features/changes/bug fixes. > xors$label<-as.factor(xors$label-1) > xorc$label<-as.factor(xorc$label-1) # labelã[0,1]ã«ç´ã > tuneRF(xors[,-3],xors[,3],doBest=T) mtry = 1 OOB error = 3% Searching left ... Searching right ... mtry = 2 OOB error = 3% 0 0.05 Call: randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 3% Confusion matrix: 0 1 class.error 0 48 2 0.04 1 1 49 0.02 > tuneRF(xorc[,-3],xorc[,3],doBest=T) mtry = 1 OOB error = 34% Searching left ... Searching right ... mtry = 2 OOB error = 34% 0 0.05 Call: randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 32% Confusion matrix: 0 1 class.error 0 33 17 0.34 1 15 35 0.30 > xors.rf<-randomForest(label~.,xors) > xorc.rf<-randomForest(label~.,xorc) # æ®éã«randomForest()ã§åé¡ãã > px<-seq(-3,3,0.03) > py<-seq(-3,3,0.03) > pgrid<-expand.grid(px,py) > names(pgrid)<-c("x","y") # åé¢è¶ å¹³é¢ãæãããã®ã°ãªãããä½ã > plot(xors[1:50,-3],col="blue",pch=19,cex=3,xlim=c(-3,3),ylim=c(-3,3)) > points(xors[51:100,-3],col="red",pch=19,cex=3) > par(new=T) > contour(px,py,array(out.xors.rf,dim=c(length(px),length(py))),xlim=c(-3,3),ylim=c(-3,3),col="purple",lwd=3,drawlabels=F,levels=0.5) # ã·ã³ãã«ãã¿ã¼ã³ã§åé¢è¶ å¹³é¢ãæã > plot(xorc[1:50,-3],col="blue",pch=19,cex=3,xlim=c(-3,3),ylim=c(-3,3)) > points(xorc[51:100,-3],col="red",pch=19,cex=3) > par(new=T) > contour(px,py,array(out.xorc.rf,dim=c(length(px),length(py))),xlim=c(-3,3),ylim=c(-3,3),col="purple",lwd=3,drawlabels=F,levels=0.5) # è¤éãã¿ã¼ã³ã§åé¢è¶ å¹³é¢ãæã > table(xors$label,predict(xors.rf,xors[,-3])) 0 1 0 50 0 1 0 50 # ã·ã³ãã«ãã¿ã¼ã³ã®åé¡æ£ççã¯100% > table(xorc$label,predict(xorc.rf,xorc[,-3])) 0 1 0 50 0 1 0 50 # è¤éãã¿ã¼ã³ã§ãåé¡æ£ççã¯100%
ã·ã³ãã«ãã¿ã¼ã³ã®å ´åã¨ã
è¤éãã¿ã¼ã³ã®å ´åãã¡ãªã¿ã«ããã ã¨ãã¼ã«ã¼ãã§ãéããæ°ãããªãã§ããªãã®ã§ããã¼ã«ã¼ãå°ããããã¨ãããªé¢¨ã«è¦ãã¾ãã
åè ã¯ã¨ããããå¾è ã¯æ±åã®ãã®åããªããããããªããªã«éå¦ç¿ãã¦ãæ°ããããã§ããï¼ç¬ï¼ãã¾ããããªæãã§ããä½ã¯ã¨ããããå®ã¯ããã¾ã§ã®æ©æ¢°å¦ç¿åé¡å¨ã®ä¸ã§ã¯ãã£ã¨ãåé¡æ£ççãé«ãï¼ã©ã¡ãã100%ï¼ã¨ããçµæã«ãªãã¾ããã
ãªããrandomForest()é¢æ°ã§ã¯ntreeså¼æ°ã§ã©ã³ãã ãã©ã¬ã¹ãã¨ãã¦çæãã決å®æ¨ã®æ°ãå®ãããã¨ãã§ãã¦*5ãåæã«plot.randomForest()é¢æ°ã§ntreesã«å¯¾ããOOB error rateã®å¤åãè¦ããã¨ãã§ãã¾ãã
> plot(xors.rf) > plot(xorc.rf)
ã·ã³ãã«ãã¿ã¼ã³ã ã¨ãããªãOOB error rateãåæãã¦ãããã§ãããè¤éãã¿ã¼ã³ã ã¨ä½ã ããµãã¤ãã¦ãæ°ãããªãããªãã§ãããããããããéå¦ç¿ãã¦ãã®ããããã¾ããããã
æå¾ã«
ã©ã³ãã ãã©ã¬ã¹ãã¯é常ã«å¼·åãªåé¡ææ³ãªãã§ããããããªOOBã«è¦ããããããªæ±åãè¡ãããã¨çè«çã«ã¯è¨ããã¦ãã¦ããåé¢è¶
å¹³é¢ã®ä¾ãè¦ã¦ã®éãå¦ç¿ãã¼ã¿æ¬¡ç¬¬ã§ã¯æ±åãå©ããªãå¯è½æ§ãããã¨ããç¹ã«æ³¨æãå¿
è¦ãªããããªããã¨æã£ã¦ã¾ããã¨ããã®ããçµå±ã¯æ±ºå®æ¨ã®éåä½ã§ãã以ä¸ã¯ã©ããã¦ãã軸ã«å¹³è¡ãã«ããï¼é¨åçã§ã¯ãã£ã¦ãï¼åé¢è¶
å¹³é¢ã¯å¼ããªãããã§ã
ããããæå³ã§è¨ãã¨ãOOB error rateã«ããè©ä¾¡ãéè¦ãªãã§ãããã以ä¸ã«å®ãã¼ã¿ã®cross validationã«ããæ§è½è©ä¾¡ãå¿
è¦ãªããããªãããªãã¨ããæ±åãåªå
ãããªãSVMã¨ãä½ããªã¢ã³ã¹ãªã¢ãã«ã使ã£ãæ¹ãè¯ãããã ãããã¨ããã®ãçç´ãªææ³ã§ãã
ä»æ¥ã¯ã¯ãªã¹ãã¹ã»ã¤ã´ãªã®ã§
ã¯ãªã¹ãã¹ããªã¼ãã©ã³ãã ãã©ã¬ã¹ãã§æãã¦ã¿ã¾ããï¼
ãã£ãã¼ã®ã¶ã®ã¶ï¼ç¬ï¼ããã¾ãã«ãã²ã©ãã®ã§ãSVMã§æãç´ãã¦ã¿ã¾ããã
ä½ã¼ããã·ã§ããããã¨ãããã¨ã§ãã¡ãªã¼ã¯ãªã¹ãã¹ï¼
*1:ãã¡ããåç §â http://d.hatena.ne.jp/shakezo/20130715/1373874047
*2:ãããã決å®æ ªãã®èªãå ã¦ã¦ããã®ã§è²ã ããã
*3:æå±±æ¬p.95ã«ãããããã«ãçç®ããææ¨ãä¸ã¤é¸ãã§ã½ã¼ããã¦ãåé¢èª¤å·®ã表ãææ¨ã®æãå°ããã£ããã®ãé¸ãã§ã¯åã£ã¦ããã ãã®ã¢ã«ã´ãªãºã ãªã®ã§ãã©ããã£ã¦ãè² è·ã¯è»½ããªã
*4:å帰åæã«ãããåå帰ä¿æ°ã¨ç«å ´çã«ã¯åã
*5:ããã©ã«ãã§ã¯ntrees = 500