ã¡ãã£ã¨èª¿ã¹ã¦ã¿ããã¿ã¤ãã«ã®ä»¶ã«ã¤ãã¦è¨åãã¦ãè¨äºããã¾ãå¤ããªãã£ãã®ã§ããã£ããæ¸ãã¦ã¿ã¾ãããªãããã®è¨äºã¯id:shakezoããã®
ã¸ã®ãªãã¼ã¸ã¥ã§ããã¨ããããå®ã¯åããã®è¨äºãèªãã§ãããå¤åRãªãå°ç¨ã®é¢æ°ãªãããããã ãããç°¡åã«ã§ããã¯ããã¨æã£ã¦ä»¥åãããç©æ¥µçã«ããããã«ãªã£ãã®ã§ããï¼ç¬ï¼ã
ç·è«ï¼ä½ã§æ©æ¢°å¦ç¿ããã®ã«ãã¥ã¼ãã³ã°ãå¿ è¦ãªã®ï¼
ã©ããªæ©æ¢°å¦ç¿ã§ããä½ãããã®ãã¥ã¼ãã³ã°ãã©ã¡ã¼ã¿ãæã£ã¦ãã¾ããä¾ãã°ã½ãããã¼ã¸ã³SVMãªããã¼ã¸ã³ãã©ã¡ã¼ã¿Cãããã¾ãããéç·å½¢ã¬ã¦ã·ã¢ã³ã«ã¼ãã«SVMãªãããã«ã«ã¼ãã«ãã©ã¡ã¼ã¿ã®Ïã¨ããå
¥ãã¾ããSMOï¼é次æ大æé©åï¼ã¢ã«ã´ãªãºã ãå©ç¨ããã®ã§ããã°ãããã«ããã«toleranceã¨ããå
¥ã£ã¦ãã¾ãã
ããããã¡ãã£ã¨ããã£ã¦ã¿ãã°ããåããã¨æããã§ãããã®è¾ºã®ãã¥ã¼ãã³ã°ãã©ã¡ã¼ã¿ãå¤ããã ãã§æ©æ¢°å¦ç¿ã®çµæã¯ã¬ã³ã¬ã³å¤ããã¾ãã試ãã«2次å SVMã¨ãã§åé¢è¶ å¹³é¢ãæ¸ãã¦ã¿ãã¨ãCã¨Ïãå¤ããã ãã§å ¨ãå¤ãã£ã¦ãã¾ãã¾ãã
ä¸ã®å³2ã¤ã¯åã以åç¬ç¿ç¨ã«Matlabã§æ¸ããSMOã¢ã«ã´ãªãºã ã«ããéç·å½¢ã¬ã¦ã·ã¢ã³ã«ã¼ãã«SVM*1ã®å®è¡ä¾ãªãã§ããããã
ã»ã¼åãå¦ç¿ãã¼ã¿ã«å¯¾ãã¦ç°ãªãCã¨Ïã®çµã¿åããã§åé¢è¶ å¹³é¢ãã³ã³ã¿ã¼ããããã§æãã¦ã¿ãã¨ããããªã«éãã¾ããä¸ã¯ç¶ºéºã«æ±åããã¦ãã¾ãããä¸ã¯æããã«ãªã¼ãã¼ãã£ããã£ã³ã°ã§ãã
ã¨ãããã¨ã§ãæ©æ¢°å¦ç¿ãããªããã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã¯å¿ é ããã®Rã§ã®ããæ¹ãã代表çãªããã±ã¼ã¸ãåãä¸ããªããã¡ããã£ã¨è¦ã¦ã¿ããã¨æãã¾ããåã®åå¼·ãå ¼ãã¦ããã
{randomForest}ããã±ã¼ã¸ã®å ´å
ã©ã³ãã ãã©ã¬ã¹ãã®åçã¨ãå©ç¹ã¨ãã¯å
¨ã¦id:shakezoããã®è¨äºãèªãã§ä¸ããã¨ãããã¨ã§ãã¾ããã®ããã°ã§ãé »åºã®{randomForest}ããã±ã¼ã¸ã§ããå ´åã®ãã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã®ã¨ãããåãä¸ãã¾ãã
{randomForest}ããã±ã¼ã¸ãããã
é¢åãªã®ã§ã以前の記事ã§ä½¿ã£ãサンプルデータããã®ã¾ã¾è»¢ç¨ãã¾ãããã®æåæ§ã¨ããããsample_dã¨ãããååã«ãã¦ããã¾ãã
æ¢ã«ä½åº¦ãè²ã ãªè¨äºã§è§¦ãã¦ã¾ãããã©ã³ãã ãã©ã¬ã¹ãã¯randomForest(){randomForest}é¢æ°ã§ç°¡åã«ããã¾ããã¨ããããä½ãããããã«ããã©ã«ãã®ã¾ã¾ãã£ã¦ã¿ã¾ãããã
> sample_d.rf<-randomForest(cv~.,sample_d) > print(sample_d.rf) Call: randomForest(formula = cv ~ ., data = sample_d) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 6.37% Confusion matrix: No Yes class.error No 1399 101 0.06733333 Yes 90 1410 0.06000000 > importance(sample_d.rf) MeanDecreaseGini a1 21.554057 a2 11.912213 a3 2.550909 a4 219.898301 a5 82.449264 a6 735.583208 a7 2.543989
ã¾ãè¦ãã¾ãã¾ã§ããOOB estimate of error rateã6.37%ã¨ããã®æç¹ã§ãããã»ã©æªããªãæ°ã¯ãã¾ã*2ã
tuneRF()ã§ã°ãªãããµã¼ããããã¦æé©åãã
ã§ãä»åã®ãé¡ãªãã§ãããæ¨å®ãã©ã¡ã¼ã¿ãã°ãªãããµã¼ãã§ãã¥ã¼ãã³ã°ãã¦æé©åããæ¹ãè¯ãã§ããï¼ã¨ãããã¨ãªã®ã§ããã£ã¦ã¿ã¾ããããid:shakezoããã®è¨äºã«ãããããã«ã
RandomForestã®ä¸»è¦ãªãã©ã¡ã¼ã¿ã¯æ¬¡ã®2ã¤ã§ãã
- ä½æãã決å®æ¨ã®æ°
- ï¼ã¤ï¼ã¤ã®æ±ºå®æ¨ãä½æããéã«ä½¿ç¨ããç¹å¾´éã®æ°
ãããã®ãã¡ãrandomForest(){randomForest}é¢æ°ã§ã¯ä½æãã決å®æ¨ã®æ°ã¯ntreeã1ã¤1ã¤ã®æ±ºå®æ¨ãä½æããéã«ä½¿ç¨ããç¹å¾´éã®æ°ã¯mtryã§æå®ã§ãã¾ãããã®ãã¡ãntreeã¯å¦ç¿çµæã®randomForest.formulaã¯ã©ã¹ãã¼ã¿ãplot()é¢æ°ã§å³ç¤ºãããã¨ã§ntreeãå¢ãããã¨ã«åæ度åããã©ãå¤ããããè¦ããã¨ãã§ããã®ã§ãäºå¾ã®å¤å®ã«ãªãã¾ãã
ãã£ã¦ãããã§ã®ç¦ç¹ã¯mtryä¸æããããã°ãªãããµã¼ãããæ¹æ³ã¯ç°¡åãtuneRF(){randomForest}é¢æ°ã使ãã ãã§ãã
> sample_d.tune<-tuneRF(sample_d[,-8],sample_d[,8],doBest=T) mtry = 2 OOB error = 6.17% Searching left ... mtry = 1 OOB error = 9% -0.4594595 0.05 Searching right ... mtry = 4 OOB error = 6.63% -0.07567568 0.05
ãããªæãã§ãdoBestãTrueã«ãã¦ããã¨mtryã®åè£ãå³ç¤ºãã¦é¸ãã§ããã¾ãããªãtuneRF()é¢æ°ã§ãntreeãæå®ã§ãã¾ãããããã©ã«ãã§åé¡ãªãã§ãããã
ã¨ãããã¨ã§ãå®éã«mtry=2ãæå®ãã¦ããç´ãã¦ã¿ã¾ããã
> sample_d.rf2<-randomForest(cv~.,sample_d,mtry=2) > print(sample_d.rf2) Call: randomForest(formula = cv ~ ., data = sample_d, mtry = 2) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 6.33% Confusion matrix: No Yes class.error No 1400 100 0.06666667 Yes 90 1410 0.06000000 > importance(sample_d.rf2) MeanDecreaseGini a1 21.687817 a2 12.408733 a3 2.508693 a4 196.222452 a5 76.332301 a6 776.395852 a7 2.674710
OOB estimate of error rateã6.33%ã¨ã»ãã®å°ãã§ãããæ¹åããã¾ãããã¤ãã§ãªã®ã§ãntreeãå¤ãã¦ããä¸åº¦ãã£ã¦ã¿ã¾ãããã
> sample_d.rf3<-randomForest(cv~.,sample_d,mtry=2,ntree=2000) > print(sample_d.rf3) Call: randomForest(formula = cv ~ ., data = sample_d, mtry = 2, ntree = 2000) Type of random forest: classification Number of trees: 2000 No. of variables tried at each split: 2 OOB estimate of error rate: 6.3% Confusion matrix: No Yes class.error No 1403 97 0.06466667 Yes 92 1408 0.06133333 > importance(sample_d.rf3) MeanDecreaseGini a1 21.806920 a2 12.199403 a3 2.467463 a4 201.304439 a5 79.126512 a6 767.086724 a7 2.765361 > plot(sample_d.rf3)
OOB estimate of error rateã6.3%ã´ã£ããã¾ã§è½ã¡ã¾ããï¼ã¾ãã»ãã®ã¡ãã£ã¨ã§ããï¼ãå¦ç¿ã®åæç¶æ³ããããããã¦ã¿ãã¨ãå®ã¯ntree = 500ãããã§ããååã«åæãã¦ãã¦ã2000ã¾ã§å¢ããå¿ è¦ã¯ãªãã£ããããã¨ãããã¨ãåããã¾ãããªã®ã§ãè¨ç®è² è·ãèãã¦ntree = 500ã妥å½ã¨ããçµè«ã«è³ãã¾ããã
ã»ã»ã»ã¨ãããã¨ã§ãtuneRF()ã§ã°ãªãããµã¼ããã¦ãã¥ã¼ãã³ã°ããããããã¨ããã®ãè¦ã¦ã¿ã¾ãããå®ãã¼ã¿ã ã¨ãã¥ã¼ãã³ã°ãªãã ã¨å ¨ç¶error rateãè½ã¡ãªããï½ã¿ãããªãã¨ããã³ãã³åºã¦ããã¨æãã®ã§ãéè¦ã§ããã
{e1071}ããã±ã¼ã¸ã®å ´å
{e1071}ã¯æ±ç¨ããã±ã¼ã¸ãªã®ã§è²ã
ããã¾ããã代表ä¾ã¨ãã¦LIBSVM*3ã®Rå®è£
ã§ããsvm()é¢æ°ã§ãã£ã¦ã¿ã¾ããã¢ãã«ã®æ¨å®èªä½ã¯ãããªæãã§ãã
> sample_d.libsvm<-svm(cv~.,sample_d) > summary(sample_d.libsvm) Call: svm(formula = cv ~ ., data = sample_d) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.1428571 Number of Support Vectors: 468 ( 216 252 ) Number of Classes: 2 Levels: No Yes
svm(){e1071}ã«å¯¾ãã¦ãã¥ã¼ãã³ã°ããããã°ãtune.svm()é¢æ°ã使ãã¾ãã
> t<-tune.svm(cv~.,data=sample_d) > summary(t) Error estimation of âsvmâ using 10-fold cross validation: 0.06468537 > t$best.parameters dummyparameter 1 0 > t$best.performance [1] 0.06468537 > t$best.model Call: best.svm(x = cv ~ ., data = sample_d) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.1428571 Number of Support Vectors: 468
æåã«ããã©ã«ãã§åæã«ãã¥ã¼ãã³ã°ããã¦ãã¾ã£ã¦ããã®ã§ããã¾ãé¢ç½ããªãã§ããï¼æ±ï¼ãæ®éã«irisãã¼ã¿ã¨ãã§ãã£ã¨éãã§ã¿ã¦ãè¯ãã ããã¨ã¯æãã¾ãã
ãªãããã¡ãã®èª¬æã¯id:hoxo_mããã®ãSVM のチューニングのしかた(2)ããåèã«ããã¦ããã ãã¾ãããã¨ãããããã¡ãã®è¨äºã®æ¹ã
ãã¥ã¼ãã³ã°ã®æé ã¨ãã¦ã¯ã
- ã°ãªãããµã¼ãã§å¤§éæã«æ¤ç´¢ããã
- æé©ãªãã©ã¡ã¼ã¿ããããããªã¨ãããçµã£ã¦åã³ã°ãªãããµã¼ããè¡ãã
ã¨ãã2段éã®ã°ãªãããµã¼ããè¡ãã¾ãã
ã¨ããããã«çéã®ãã¥ã¼ãã³ã°æé ããã¡ãã¨è¸ãã§å®è·µãã¦ããï¼æé©ãã©ã¡ã¼ã¿æ¢ç´¢ã«ããããã使ã£ã¦ããï¼ä¾ãç´¹ä»ãã¦ããã®ã§ããè¦ãã§ãã
ã¡ãªã¿ã«{e1071}ã¯ä»ã«ãkæè¿åã»ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã»æ±ºå®æ¨*4ã»ã©ã³ãã ãã©ã¬ã¹ãã¨ãã£ãææ³ãå®è£ ãã¦ãã¦ããã®å ¨ã¦ã«tune.XXX()é¢æ°ãåãã¦ããã®ã§ãã©ã®ææ³ã§ããã¥ã¼ãã³ã°ãæ軽ã«ã§ãã¾ãã
{caret}ããã±ã¼ã¸ã®å ´å
{caret}ã¯è¨å¤§ãªæ©æ¢°å¦ç¿ææ³*5ãã«ãã¼ãã¦ãã¾ãããåºæ¬çã«ã¯ã¡ã¤ã³ã¨ãªãtrain()é¢æ°ã®ä¸ã®tuneLength, tuneGridã¨ãã2ã¤ã®å¼æ°ã§ãã¥ã¼ãã³ã°å¨ãã¯ã³ã³ããã¼ã«ãã¾ããä¾ãã°method="svmRadial"*6ã®å ´åã¯ãããªæãã«ãªãã¾ãã
> sample_d.c_svm<-train(cv~.,data=sample_d,method="svmRadial",trace=T,tuneLength=10) > print(sample_d.c_svm) 3000 samples 7 predictors 2 classes: 'No', 'Yes' No pre-processing Resampling: Bootstrap (25 reps) Summary of sample sizes: 3000, 3000, 3000, 3000, 3000, 3000, ... Resampling results across tuning parameters: C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.935 0.87 0.00551 0.011 0.5 0.936 0.871 0.00532 0.0107 1 0.935 0.87 0.00561 0.0112 2 0.934 0.867 0.00624 0.0125 4 0.933 0.865 0.00596 0.0119 8 0.932 0.863 0.00587 0.0118 16 0.931 0.861 0.00602 0.0121 32 0.93 0.861 0.00596 0.0119 64 0.93 0.861 0.00596 0.0119 128 0.93 0.861 0.00596 0.0119 Tuning parameter 'sigma' was held constant at a value of 0.1179734 Accuracy was used to select the optimal model using the largest value. The final values used for the model were C = 0.5 and sigma = 0.118.
tuneLength=10ã¨ããã®ã§ã10éãã®ãã©ã¡ã¼ã¿ã«å¯¾ãã¦ãã¥ã¼ãã³ã°ãã¹ããè¡ããã¦ãã¾ãã2ã®åæ°ã«æ²¿ã£ã¦ãã¼ã¸ã³ãã©ã¡ã¼ã¿Cãåããã¦ãæãã§ããããã
ã§ããã®çµæãè¦ãã°åããããã«ããã¡ãã£ã¨ã°ãªãããµã¼ãã®ç¯å²ãçµãã°ããã©ã¼ãã³ã¹ãè¯ããªãæ°ãããã®ã§ãcreatGrid()é¢æ°ã§ã°ãªãããµã¼ãã®ç¯å²ãæ示çã«æ±ºãã¦ãtrain()é¢æ°ã®å¼æ°tuneGridã¨ãã¦ä¸ãã¦ã¿ããã¨ã«ãã¾ããã
> t.grid<-createGrid("svmRadial",data=sample_d,len=4) > print(t.grid) .sigma .C 1 0.01 0.25 2 0.10 0.25 3 1.00 0.25 4 10.00 0.25 5 0.01 0.50 6 0.10 0.50 7 1.00 0.50 8 10.00 0.50 9 0.01 1.00 10 0.10 1.00 11 1.00 1.00 12 10.00 1.00 13 0.01 2.00 14 0.10 2.00 15 1.00 2.00 16 10.00 2.00 > sample_d.c_svm<-train(cv~.,data=sample_d,method="svmRadial",trace=F,tuneGrid=t.grid) > print(sample_d.c_svm) 3000 samples 7 predictors 2 classes: 'No', 'Yes' No pre-processing Resampling: Bootstrap (25 reps) Summary of sample sizes: 3000, 3000, 3000, 3000, 3000, 3000, ... Resampling results across tuning parameters: C sigma Accuracy Kappa Accuracy SD Kappa SD 0.25 0.01 0.935 0.87 0.00546 0.0109 0.25 0.1 0.936 0.872 0.00574 0.0115 0.25 1 0.932 0.864 0.00544 0.0109 0.25 10 0.931 0.862 0.00512 0.0103 0.5 0.01 0.935 0.87 0.00546 0.0109 0.5 0.1 0.936 0.872 0.00545 0.0109 0.5 1 0.932 0.865 0.00431 0.0087 0.5 10 0.932 0.865 0.00431 0.0087 1 0.01 0.935 0.87 0.00546 0.0109 1 0.1 0.937 0.873 0.00516 0.0103 1 1 0.932 0.865 0.00431 0.0087 1 10 0.932 0.865 0.00431 0.0087 2 0.01 0.935 0.87 0.00546 0.0109 2 0.1 0.935 0.871 0.00454 0.00912 2 1 0.932 0.865 0.00431 0.0087 2 10 0.932 0.865 0.00431 0.0087 Accuracy was used to select the optimal model using the largest value. The final values used for the model were C = 1 and sigma = 0.1.
ä½ã¨ãªãã¨ããã©ãããã¼ã«ã«ãããã ã¿ãããªçµã¿åãããããããã§ãããããããªããããã«accuracyãåä¸ãã¾ãããããã¾ãã°ãªãããµã¼ãç¯å²ãåºããã¨è¨ç®æéããããã®ã§ããã®è¾ºã¯é©å½ã«ã
ä»ã«ãmethod="nnet"ãããªãã¡ãã¥ã¼ã©ã«ãããã¯ã¼ã¯ã®å ´åã ã¨ãããªæãã«ãªãã¾ãã
> sample_d.c_nnet<-train(cv~.,data=sample_d,method="nnet",tuneLength=4,maxit=100,trace=F) > print(sample_d.c_nnet) 3000 samples 7 predictors 2 classes: 'No', 'Yes' No pre-processing Resampling: Bootstrap (25 reps) Summary of sample sizes: 3000, 3000, 3000, 3000, 3000, 3000, ... Resampling results across tuning parameters: size decay Accuracy Kappa Accuracy SD Kappa SD 1 0 0.937 0.874 0.00796 0.0159 1 1e-04 0.937 0.874 0.00717 0.0143 1 0.00316 0.937 0.874 0.0069 0.0138 1 0.1 0.936 0.872 0.00703 0.0141 3 0 0.934 0.869 0.00704 0.0141 3 1e-04 0.935 0.87 0.00648 0.013 3 0.00316 0.933 0.866 0.0144 0.0288 3 0.1 0.936 0.872 0.0066 0.0132 5 0 0.931 0.863 0.00653 0.0131 5 1e-04 0.933 0.866 0.00691 0.0138 5 0.00316 0.932 0.864 0.0067 0.0134 5 0.1 0.935 0.87 0.00702 0.0141 7 0 0.928 0.856 0.0217 0.0429 7 1e-04 0.932 0.863 0.00673 0.0135 7 0.00316 0.932 0.864 0.00673 0.0135 7 0.1 0.934 0.868 0.00722 0.0145 Accuracy was used to select the optimal model using the largest value. The final values used for the model were size = 1 and decay = 0.
tuneLength=4ã¨ãããã¨ã§ã4éãã®ãã©ã¡ã¼ã¿ï¼nnetã®å ´åã¯åã
ã®æ¬ããã¨ã«4éãï¼ãã°ãªãããµã¼ãã§è©¦ããçµæã示ããã¦ãã¾ããnnetã ã¨åæã«æéããããã±ã¼ã¹ãããã®ã§ãããã¾ãã°ãªãããµã¼ãç¯å²ã®æ±ºãæ¹ã¯è²ã
æ°ãä»ããæ¹ãè¯ãããã§ãã
ãã¾ã
ç¹ã«{e1071}, {caret}ã¯ææ³ãã¨ã«ãã¥ã¼ãã³ã°ã®ããæ¹ãã¾ã¡ã¾ã¡ãããã®ã§ããã®è¾ºã¯ã¾ããã¬ãã¸ããã¾ã£ããç´¹ä»ãã¾ãããããã®ãã¡ã©ããã®TokyoRã§è©±ããããªã
*1:å®ã¯Matlabã¹ã¯ãªãããGitHubã®ææã«è»¢ããã¦ããã¾ãããã
*2:ããã10ä¸ã¬ã³ã¼ãã§50次å 以ä¸ã®å®ãã¼ã¿ã¨ãã ã¨è¦ããããªãæ°åã«ãªããã¨ãããã
*3:è¨ããã¨ç¥ããå¤è¨èªå¯¾å¿SVMããã±ã¼ã¸ã§ã
*4:ã§ãèå¿ã®æ±ºå®æ¨ãå®è¡ããé¢æ°ãè¦å½ãããªãããã
*5:ããããä»ã®ããã±ã¼ã¸ãå±±ã»ã©ä¾åé¢ä¿ã§å¼ã£å¼µã£ã¦ãã¦ãã
*6:{kernlab}ã®ksvm()é¢æ°ã®ã©ããã¼ããã