ããè¨ãã°3å¹´åã«ãããªã¾ã¨ãçã¨ã³ããªãæ¸ããã®ã§ããããã®å 容ã¯ãã®ã¾ã¾ããªãã®é¨åã2å¹´åã«åè¡ããæèã®åæ¡ã«ããªã£ãã¨ãããã¨ã§ãè²ã æãåºæ·±ãã¨ã³ããªã§ãã
ãªã®ã§ãããã»ã»ã»ãã®3å¹´ã®éã«çµ±è¨å¦ã»æ©æ¢°å¦ç¿ã»ãã¼ã¿ãã¤ãã³ã°ã®è«¸ææ³åã³ãããåãå·»ããã¸ãã¹ãã¼ãºã«ã¯æ§ã
ãªé²æ©ãããããããããã®å
容ã«ãé³è
åãç®ç«ã¤ããã«ãªã£ã¦ãã¾ãããã¨ãããã¨ã§ã3å¹´éã®é²æ©ãåæ ãã¦ã¢ãããã¼ãããè¨äºãæ¸ãã¦ã¿ããã¨æãã¾ããååã¯ã10é¸ãã§ããããä»åã¯ã10+2é¸ãã«æ¹ãã¾ããããã®ã©ã¤ã³ãããã¯ä»¥ä¸ã®éãã
- çµ±è¨å¦çæ¤å®ï¼tæ¤å®ã»ã«ã¤äºä¹æ¤å®ã»ANOVAãªã©ï¼
- éå帰åæï¼ç·å½¢å帰ã¢ãã«ï¼
- ä¸è¬åç·å½¢ã¢ãã«ï¼GLMï¼ãã¸ã¹ãã£ãã¯å帰ã»ãã¢ã½ã³å帰ãªã©ï¼
- ã©ã³ãã ãã©ã¬ã¹ã
- Xgboostï¼å¾é ãã¼ã¹ãã£ã³ã°æ¨ï¼
- Deep Learning
- MCMCã«ãããã¤ã¸ã¢ã³ã¢ããªã³ã°
- word2vec
- K-meansã¯ã©ã¹ã¿ãªã³ã°
- ã°ã©ãçè«ã»ãããã¯ã¼ã¯åæ
- ãã®ä»ã®æç¨ãªææ³ãã¡
- çµ±è¨å¦ã»æ©æ¢°å¦ç¿ã®è«¸ææ³ã«ã¤ãã¦å¦ã¶ä¸ã§ç¢ºèªãã¦ãããããã¤ã³ã
- æå¾ã«
- 追è¨
ååããã ãã¶çµã¿æ¿ããã¾ããããããã ãå®åã®ç¾å ´ã§ç¨ãããããã¼ã¿åæææ³ã®é åãåºãã£ãã¨ãããã¨ããªã¨åæã«èãã¦ããã¾ãï¼ç¬ï¼ãã¾ããå©ç¨ããããã±ã¼ã¸ã»ã©ã¤ãã©ãªã®é½åä¸ä»åã¯Rã ãã§ã¯ãªãPythonã®ãã®ãå«ãã¦ãã¾ã*1ããã ãåºæ¬çã«ã¯Rä¸ã§ã®å®è¡ä¾ãç´¹ä»ãã¦ããæãã§ãã
䏿¹ããè£ããã¤ãã2ææ³ã«ã¤ãã¦ã¯ãã¼ã¿åææ¥çã§ã¯åºã使ããã¦ãããã®ã®åãæ®æ®µå®è·µãã¦ããªãææ³ã§ãããããããã ãã¯åºæ¬çã«ã¯ä»ã®è³æãåç
§ããªããã®ç´¹ä»ã«çãã¦ãã¾ããã¨ãããã¨ã§ã以ä¸ãã£ããè¦ã¦ããã¾ãããã
Disclaimer
- ä»åãåºæ¬çã«ã¯ãã²ã¨ã¤ã®è¨äºã§å¤§éæã«çºãããã人åãã®è¨äºãªã®ã§ãã¡ããã¡ããç´°ããã¨ããã§å³å¯æ§ãæ¬ ãã¦ãããã説æä¸è¶³ã ã£ãããã¯ãã¾ãä»ã«å¿ è¦ãªè³æã®æç¤ºãæ¬ ãã¦ããã¨ããããããã¨æãã¾ãã®ã§ããã®è¾ºã¯ä½åã容赦ããã¾ãã¹ã¯ã©ããããã®å®è£ ã«å¿ è¦ãªç¥èãæä¾ãããã®ã§ãããã¾ããã®ã§ãã©ããæªãããã
- ä»åã®è¨äºã§ã¯ããããã®ãã¼ã¿åæææ³ãç´¹ä»ãããã¨ã«ä¸»ç¼ãç½®ãã¦ããã®ã§ãåã ã®ããã±ã¼ã¸ã»ã©ã¤ãã©ãªé¡åã³ãããã®ãã«ãã«å¿ è¦ãªã³ã³ãã¤ã©ç°å¢ãªã©ã®ã¤ã³ã¹ãã¼ã«æ¹æ³ãªã©ã®è©³ç´°ã¯ã»ã¼å²æãã¦ããã¾ããã¤ã³ã¹ãã¼ã«ã«éãã¦ã¯é©å®ãªã³ã¯å ã®è¨äºãåç §ãããªããã°ã°ããªããã¦ãã ãã
- ãã ããæããã«çè«çã«èª¤ã£ã¦ãã説æãªã©ãããå ´åã¯ç´ã¡ã«ä¿®æ£ãããã¾ãã®ã§ãã³ã¡ã³ãæ¬ãªãSNSä¸ã§ã®ã³ã¡ã³ããªãã§TJOã¾ã§æ¯éãç¥ãããã ãã
çµ±è¨å¦çæ¤å®ï¼tæ¤å®ã»ã«ã¤äºä¹æ¤å®ã»ANOVAãªã©ï¼
æå¤ã¨å¤ãã®ç¾å ´ã§æ ¹å¼·ã人æ°ãèªãã®ããæ¥µãã¦å¤å
¸çã§é »åº¦ä¸»ç¾©çãªãçµ±è¨å¦çæ¤å®ããè¦ããã«ãA/Bãã¹ãããåãã¨ãã¦ãä½ãã¨ä½ããæ¯è¼ãããå ´åãã«ãã®æ¯è¼çµæãçµ±è¨å¦çã«ã¯ã£ããããããæã«ä½¿ãæ¹æ³è«ã§ãã[twitter:@KuboBook]å
çãä»°ãããã«å®éã«ã¯æ¤å®ä¸è¾ºåã§è¡ãããã¯çµ±è¨ã¢ããªã³ã°ã¨ãã«ã·ããããæ¹ããã表ç¾åãé«ãã¦è¯ãã®ã§ãããä»ã§ããã¸ãã¹ã®ç¾å ´ã§ã¯æ§ã
ãªæææ±ºå®ã®ãµãã¼ããç®çã¨ãã¦å¤ç¨ããã¦ããããã§ããããã§ã¯3ã¤ã®ä¾ãæãã¾ãã
tæ¤å®
åºæ¬çã«ã¯ãå¹³åå¤å士ã§å·®ããããã©ãããæ¯è¼ãããå ´åãã«ä½¿ãã¾ãã以åæèã§ç¨ãããµã³ãã«ãã¼ã¿ã»ããã§è©¦ãã¦ã¿ã¾ããããæ³å®ã¨ãã¦ã¯ããããã¼ã¿ãã¼ã¹åºç¤2種é¡ã®éã§ç¹å®ã®ã¯ã¨ãªã®ã¬ã¤ãã³ã·åå£«ãæ¯è¼ãã¦ãã©ã¡ããããé«éããæããã«ããããã¨ããã±ã¼ã¹ã§ãã
> d<-read.csv('https://raw.githubusercontent.com/ozt-ca/tjo.hatenablog.samples/master/r_samples/public_lib/DM_sampledata/ch3_2_2.txt',header=T,sep=' ') > head(d) DB1 DB2 1 0.9477293 2.465692 2 1.4046824 2.132022 3 1.4064391 2.599804 4 1.8396669 2.366184 5 1.3265343 1.804903 6 2.3114898 2.449027 > boxplot(d) # ç®±ã²ãå³ï¼ä¸ãåç §ï¼ããããããã > t.test(d$DB1,d$DB2) # tæ¤å®ã¯t.test颿°ã§ Welch Two Sample t-test data: d$DB1 and d$DB2 t = -3.9165, df = 22.914, p-value = 0.0006957 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.0998402 -0.3394647 sample estimates: mean of x mean of y 1.575080 2.294733 # ç忣æ§ãä»®å®ããªãWelchã®æ¤å®ãèªåçã«é©ç¨ããã
p < 0.05ã¨ãããã¨ã§ãDB1ã®æ¹ãããï¼çµ±è¨å¦çã«ææã«ï¼é«éã§ãããã¨ã¿ãªãã¦è¯ãããã§ã*2ã
ã«ã¤äºä¹æ¤å®
ãã¸ãã¹ã·ã¼ã³ã«ãããã¡ãªãã¿ã¼ã³ã¨ãã¦ã¯ããæ½çã®æç¡ã§ãââçãã«å·®ããããã©ããæ¯è¼ãããå ´åãã«ä½¿ããçµ±è¨å¦çæ¤å®ã§ããä¾ãã°ããã¹ããã¢ããªã®åç·æ¹åãè¡ãåã¨è¡ã£ãå¾ã¨ã§ãCVæ°ã以ä¸ã®ããã«å¤ãã£ãã¨æ³å®ãã¾ãããã
CVãã | CVããªãã£ã | |
---|---|---|
æ¹åå | 25 | 117 |
æ¹åå¾ | 16 | 32 |
ãã®ããã«ãæ¢ã«éè¨ããã(tabulated)ãã¼ã¿å士ã§ãã®ãçããæ¯è¼ãããå ´åãã¯ãtæ¤å®ã®ããã«çãã¼ã¿ãããã¯å¹³åã¨æ¨æºåå·®ãããã¼ã¿å士ã®ã°ãã¤ããè¸ã¾ãã¦æ¯è¼ããæ¹æ³ã¯ä½¿ããã代ããã«å ã®ãã¼ã¿å士ãåãåå¸ã«åºã¥ããã©ãããæ¯ã¹ãã«ã¤äºä¹æ¤å®ï¼ç¬ç«æ§ã®æ¤å®ï¼ãå®çªã§ããRã§ã¯ä»¥ä¸ã®ããã«å®è¡ã§ãã¾ãã
> d<-matrix(c(25,117,16,32),ncol=2,byrow=T) > chisq.test(d) # chisq.test颿°ã§ã«ã¤äºä¹æ¤å® Pearson's Chi-squared test with Yates' continuity correction data: d X-squared = 4.3556, df = 1, p-value = 0.03689
ãã¡ããp < 0.05ã¨ãããã¨ã§ãåç·æ¹åã«ããCVå¢å ã®å¹æããã£ãã¨ã¿ãªãã¦è¯ãããã§ãããªããåãæ½çãè¡ã£ãã¨ä»®å®ã§ãããè¤æ°ã®ãã¼ã¿ã»ãããã¨ã«ã«ã¤äºä¹æ¤å®ãè¡ã£ãçµæãçµ±åããã«ã¯ã以ä¸ã®è¨äºã§ç´¹ä»ããã¡ã¿ã¢ããªã·ã¹ã®ææ³ã使ãã¾ãã
ããã«éãããçµ±è¨å¦çæ¤å®ã®å¤ãã®ææ³ãåæ§ã«ã¡ã¿ã¢ããªã·ã¹ã®æ¹æ³è«ã«åºã¥ãã¦ãè¤æ°ã®çµæãçµ±åããããã¨ãå¯è½ãªã®ã§ãç¥ã£ã¦ããã¦æã¯ãªãã§ãã
ANOVAï¼åæ£åæï¼
è²ã
ãªãã¿ã¼ã³ãããå¾ã¾ãããåºæ¬çã«ã¯ã3ã¤ä»¥ä¸ã®ãã¼ã¿å士ã§2ã¤ä»¥ä¸ã®æ½çãæã¡åããæã«å·®ããããã©ããæ¯è¼ãããå ´åãã«ç¨ããææ³ã§ããå³å¯ã«ã¯ããç°ãªãã¾ãããã¢ã¤ãã¢ã¨ãã¦ã¯tæ¤å®ã3ã¤ä»¥ä¸ã®ãã¼ã¿åå£«ã«æ¡å¼µããã®ã¨ã»ã¼åãèãæ¹ã§ãããã ãåãçµã¿æ¹ã¨ãã¦ã¯ãã®å¾ã«åºã¦ããéå帰åæï¼æ£è¦ç·å½¢ã¢ãã«ï¼ã¨ã»ã¼åããªã®ã§ãç¹ã«æ½çã®å¹æããæ£ãè² ãããç¥ãããã±ã¼ã¹ã§ã¯ãããéå帰åæã«æ¿ããæ¹ãè¯ããã¨ãå¤ãããªã¨ã
ä¾ãã°ãããã2ã¤ã®ã«ãã´ãªã®ååãåãæ±ã対é¢è²©å£²ã³ã¼ãã¼ã§2éãã®ããã¢ã¼ã·ã§ã³ãæã¡åãã¦4æ¥éã«æ¸¡ã£ã¦è²©å£²ããæã«ãããã¢ã¼ã·ã§ã³ã®ããæ¹ã«ãã£ã¦å£²ä¸åæ°ãå¤ãããã©ããã*3ãç¥ãããã¨ãã¾ãããããã®å ´åã夿°prã«ããã¢ã¼ã·ã§ã³ã®æç¡ãã夿°categoryã«ååã«ãã´ãªã表ãã«ãã´ãªå¤æ°ãå ¥ãã¦ã売ä¸åæ°ã夿°cntã¨ããæã以ä¸ã®ãããªæãã§ANOVAã§è¨ç®ã§ãã¾ãã
> d<-data.frame(cnt=c(210,435,130,720,320,470,250,380,290,505,180,320,310,390,410,510),pr=c(rep(c('F','Y'),8)),category=rep(c('a','a','b','b'),4)) > d.aov<-aov(cnt~.^2,d) # ANOVAã¯aov颿°ã§è¨ç®ãã > summary(d.aov) Df Sum Sq Mean Sq F value Pr(>F) pr 1 166056 166056 12.984 0.00362 ** category 1 56 56 0.004 0.94822 pr:category 1 5256 5256 0.411 0.53353 Residuals 12 153475 12790 --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
ããã¢ã¼ã·ã§ã³ãç°ãªãã°å£²ä¸åæ°ãå¤ããã¨ã¿ãªãã¦è¯ãããã§ãã䏿¹ã§ååã«ãã´ãªéã§ã®å·®ã¯ãªãããªããã¤ååã«ãã´ãªãå¤ãã£ãå ´åã«ããã¢ã¼ã·ã§ã³ã®å¹æã«éããåºããã¨ããªãããã§ãï¼äº¤äºä½ç¨ãææã§ãªãããï¼ã
ãã®ä»ã®æ¤å®
Fæ¤å®ã符å·åæ¤å®ãªã©ãããã¾ããããã£ã¨è¨ãã°ãã©ã¡ããªãã¯æ¤å®ï¼ãã¼ã¿ã®åå¸å½¢ç¶ã«åæãè¦æ±ããï¼ã¨ãã³ãã©ã¡ããªãã¯æ¤å®ï¼ç¹ã«ãã¼ã¿ã®åå¸å½¢ç¶ã«ãã ãããªãï¼ã¨ã®éãã¨è¨ã£ã話ãããã¾ãããããã§ã¯å
¨ã¦å²æãã¾ããæªããããã
éå帰åæï¼ç·å½¢å帰ã¢ãã«ï¼
ããã¾ãåºæ¬ã®ãããã¿ãããªææ³ã§ããã«ãããããããç¹ã«ãã¸ãã¹ãµã¤ãã®å®åã®ç¾å ´ã§ã¯æå¤ã¨ã¾ã ä»ã§ãåºã¾ã£ã¦ããªãææ³ã®ä»£è¡¨æ ¼ã§ãã*4ããã®å®è·µä¾ã¨ãã¦ãæèã§ãç¨ãããããå°åã§ã®ãã¼ã«ã®å£²ä¸é«ãã¢ããªã³ã°ãããä¾é¡ããã£ã¦ã¿ã¾ãããã
ããã§ã¯ç®ç夿°Revenueï¼ãã®å°åã§ã®ãã¼ã«ã®å£²ä¸é«ï¼ãã説æå¤æ°ã§ããCMï¼TVCMãæ¾æ ãããããªã¥ã¼ã ï¼ãTempï¼æ°æ¸©ï¼ãFireworkï¼ãã®å°åã§ã®è±ç«å¤§ä¼ã®æç¡ï¼ã§éå帰åæã§ã¢ããªã³ã°ãããã¨ãèãã¾ãã
> d<-read.csv('https://raw.githubusercontent.com/ozt-ca/tjo.hatenablog.samples/master/r_samples/public_lib/DM_sampledata/ch4_3_2.txt',header=T,sep=' ') > head(d) Revenue CM Temp Firework 1 47.14347 141 31 2 2 36.92363 144 23 1 3 38.92102 155 32 0 4 40.46434 130 28 0 5 51.60783 161 37 0 6 32.87875 154 27 0 > d.lm<-lm(Revenue~.,d) # ç·å½¢å帰ã¢ãã«ã¯lm颿° > summary(d.lm) Call: lm(formula = Revenue ~ ., data = d) Residuals: Min 1Q Median 3Q Max -6.028 -3.038 -0.009 2.097 8.141 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 17.23377 12.40527 1.389 0.17655 CM -0.04284 0.07768 -0.551 0.58602 Temp 0.98716 0.17945 5.501 9e-06 *** Firework 3.18159 0.95993 3.314 0.00271 ** --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Residual standard error: 3.981 on 26 degrees of freedom Multiple R-squared: 0.6264, Adjusted R-squared: 0.5833 F-statistic: 14.53 on 3 and 26 DF, p-value: 9.342e-06 # 以ä¸ãããã > matplot(cbind(d$Revenue,predict(d.lm,newdata=d[,-1])),type='l',lwd=c(2,3),lty=1,col=c(1,2)) > legend('topleft',legend=c('Data','Predicted'),lwd=c(2,3),lty=1,col=c(1,2),ncol=1)
æ°æ¸©ã¨è±ç«å¤§ä¼ã®éå¬ã®æç¡ãéè¦ãããï¼ããã¦TVCMã¯ãã¾ãé¢ä¿ãªãï¼ãã¨ãåããã¾ãããä»®ã«æªæ¥ã®èª¬æå¤æ°ã®å¤ãåããå ´åï¼TVCMã¯è¨ç»å¤ããããæ°æ¸©ã天æ°äºå ±ãªã©ã®å½¢ã§å ¥æå¯è½ï¼ãpredictã¡ã½ããã§æªæ¥ã®å£²ä¸é«ãäºæ¸¬ãããã¨ãå¯è½ã§ãããªãã以åslideshareã§å ¬éããã¹ã©ã¤ãã§ãæ¸ãã¾ããããéå帰åæãå«ããç·å½¢ã¢ãã«ãã¡ããªã¼å ¨è¬ã®èãæ¹ãå³ç¤ºãããã®ããã¡ãã
è¦ã¯ãããç®ç夿°ã説æå¤æ°ã®ç·å½¢åã§è¡¨ç¾ãããããªã¢ãã«ãæé©åè¨ç»ã«ãã£ã¦æ±ãããã¨ããã®ãæ ¹æ¬çãªçºæ³ã§ããããã¯ãã®ä»ã®å¤ãã®çµ±è¨ã¢ããªã³ã°ããã³æ©æ¢°å¦ç¿ã«ããã¦ãå ±éãã¦ããæ¦å¿µã§ãã®ã§ãè¦ãã¦ããã¨è¯ãã§ãããã
ï¼â»ã¡ãªã¿ã«ãç·å½¢å帰ã¢ãã«ã¯æ©æ¢°å¦ç¿ã®åéã«ããã¦ãåºæ¬ã®ãããã¨ãã¦åãä¸ãããããã¨ãå¤ãããã®ãé»è²ãæ¬ããã¨PRMLã¯ããå¤ãã®ããã¹ãã§ã¢ã«ã´ãªãºã ãã¹ã¯ã©ããããå®è£
ããéã®é¡æã¨ãã¦ãç´¹ä»ããã¦ãã¾ããä¸è¬ã«ã¯è¡åå¼ããµã¯ãµã¯è§£ãã°ããã§ããã¾ãã¨ãã代ç©ã§ãããããã¦ææ¥é䏿³ã§ãã©ã¡ã¼ã¿æ¨å®ããã³ã¼ããPythonãããã§æ¸ãã¦æåãç¥ãã¨ããã®ãè¯ãåå¼·ã«ãªããã¨æãã¾ãï¼
ä¸è¬åç·å½¢ã¢ãã«ï¼GLMï¼ãã¸ã¹ãã£ãã¯å帰ã»ãã¢ã½ã³å帰ãªã©ï¼
ããããçµ±è¨å¦ã¨æ©æ¢°å¦ç¿ã®å¢çã¾ã¼ã³ã«ãã¦ãåæã«çµ±è¨ã¢ããªã³ã°ã®ééå³ã¿ãããªã¾ã¼ã³ã«å
¥ã£ã¦ããã¾ããåºæ¬çãªçºæ³ã¯éå帰åæï¼ç·å½¢ã¢ãã«ï¼ã¨åãã§ãããä¸ã¤éãã®ã¯ç·å½¢ã¢ãã«ã¯ç®ç夿°ããæ£è¦åå¸ã«å¾ããã¨ä»®å®ãã¦ããã®ã«å¯¾ããä¸è¬åç·å½¢ã¢ãã«ã§ã¯ç®ç夿°ãã©ã®ãããªåå¸ã«å¾ããã«ãã£ã¦ã¢ãã«ã®ç«ã¦æ¹ãå¤ããå¿
è¦ãããç¹ã§ãããããããç®ç夿°ãå¾ãåå¸ã®ååãå ãã¦ä¾ãã°ããã¢ã½ã³å帰ããè² ã®äºé
åå¸å帰ãã¨ãã£ãå¼ã³æ¹ããããã¨ãããã¾ãã
ãã¸ã¹ãã£ãã¯å帰
ããã¯äºå¤åé¡ã§ãããã¨ããæ©æ¢°å¦ç¿åéã«ããã¦ãéè¦ãªåæ©ã¨ãã¦æ±ãããææ³ã§ããåå¸ã¨ããç¹ã§ã¯äºé
åå¸ã«å¾ãä¸è¬åç·å½¢ã¢ãã«ã§ãã以åã®è¨äºã§ãç°¡åã«åãä¸ãããã¨ãããã¾ãã
ããã§ã¯æè6ç« ã®ãã¼ã¿ã使ãã¾ããd21-d26ã¾ã§ã®6ã¤ã®ããã¢ã¼ã·ã§ã³ãã¼ã¸ãç¨æããECãµã¤ãã«ããã¦ãã©ã®ãã¼ã¸ã訪åããorããªãããCVã«ããå¯ä¸ãããã調ã¹ãããã¨ããã·ããªãªã§ãã
> d<-read.csv('https://raw.githubusercontent.com/ozt-ca/tjo.hatenablog.samples/master/r_samples/public_lib/DM_sampledata/ch6_4_2.txt',header=T,sep=' ') > d$cv<-as.factor(d$cv) # ç®ç夿°ãã«ãã´ãªåã«ç´ã > d.glm<-glm(cv~.,d,family=binomial) # GLMã¯glm颿°ã«family弿°ã§åå¸ãæå® > summary(d.glm) Call: glm(formula = cv ~ ., family = binomial, data = d) Deviance Residuals: Min 1Q Median 3Q Max -2.3793 -0.3138 -0.2614 0.4173 2.4641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.0120 0.9950 -1.017 0.3091 d21 2.0566 0.8678 2.370 0.0178 * d22 -1.7610 0.7464 -2.359 0.0183 * d23 -0.2136 0.6131 -0.348 0.7276 d24 0.2994 0.8368 0.358 0.7205 d25 -0.3726 0.6064 -0.614 0.5390 d26 1.4258 0.6408 2.225 0.0261 * --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 173.279 on 124 degrees of freedom Residual deviance: 77.167 on 118 degrees of freedom AIC: 91.167 Number of Fisher Scoring iterations: 5
d21ãã¼ã¸ãæãè¯ãããã§ãéã«d22ã¯é¿ããæ¹ãè¯ãããã ã¨ãããã¨ãè¦ã¦åãã¾ãããã¡ãã®ã¢ãã«ãç·å½¢å帰ã¢ãã«åæ§ã«predictã¡ã½ããã§æªç¥ãã¼ã¿ã®èª¬æå¤æ°ãä¸ãããããããã°ç®ç夿°ãäºæ¸¬ãããã¨ãå¯è½ã§ãã
ãã¢ã½ã³å帰
䏿¹ããã¡ãã¯ã©ã¡ããã¨ããã¨ç´ç¶ããçµ±è¨å¦ã®é åã®è©±ã§ãç®ç夿°ããã¢ã½ã³åå¸ã«å¾ãå ´åã«ç¨ããããä¸è¬åç·å½¢ã¢ãã«ã§ãããã¢ã½ã³åå¸ã§ã°ã°ãã¨è²ã
ãªèª¬æãåºã¦ãã¾ãããåºæ¬çã«ã¯ãä½ãã®æ¯éå£ããã£ã¦ãã®ä¸ããç¨ã«çºçããç¾è±¡ã®ãåæ°ããå¾ã確çåå¸ã*5ã¨æãã°å¤§ä¸å¤«ã§ããä¾ãã°ãããªæãã®åå¸å½¢ç¶ã§ããã°ãã¢ã½ã³åå¸ã®å¯è½æ§ãé«ãã§ãã
ãã¸ãã¹å®åã ã¨ã1æ¥å½ããã®ç·ãµã¤ã訪åè æ°ã«å ããCVã¦ã¼ã¶ã¼æ°ããªã©ã¯ï¼åæ¯ãåæ¡ããã¨ããæ¡ä»¶ä¸ã§ï¼ãã¢ã½ã³åå¸ã«å¾ããã¼ã¿ã®ä»£è¡¨ä¾ã§ãã*6ãã¡ãªã¿ã«ä»¥åã®è¨äºã§ãã¢ã½ã³å帰ãå«ããä¸è¬åç·å½¢ã¢ãã«å ¨è¬ã«ã¤ãã¦è²ã èå¯ãããã¨ãããã®ã§ããããããã°ãã¡ããã©ããã
ã¨ããã§ä¾é¡ã§ããããã¡ãã¯æé ãªãã¼ã¿ããªãã®ã§Rã®helpã«è¼ã£ã¦ããä¾ããã®ã¾ã¾ä½¿ããã¨ã«ãã¾ããDobsonã®1990å¹´ã®èæ¸"An Introduction to Generalized Linear Models"ã«æ²è¼ããã¦ããä½ãã®ç«å¦èª¿æ»ã«é¢ãããã¼ã¿ã®ããã§ãããæ¬æ¸æªèªã«ã¤ã詳細ã¯åããã¾ããããããªããããã
> ## Dobson (1990) Page 93: Randomized Controlled Trial : > counts <- c(18,17,15,20,10,20,25,13,12) > outcome <- gl(3,1,9) > treatment <- gl(3,3) > print(d.AD <- data.frame(treatment, outcome, counts)) treatment outcome counts 1 1 1 18 2 1 2 17 3 1 3 15 4 2 1 20 5 2 2 10 6 2 3 20 7 3 1 25 8 3 2 13 9 3 3 12 > glm.D93 <- glm(counts ~ outcome + treatment, family = poisson) # family弿°ã«poissonãæå®ãããã¨ã§ãã¢ã½ã³å帰 > summary(glm.D93) Call: glm(formula = counts ~ outcome + treatment, family = poisson) Deviance Residuals: 1 2 3 4 5 6 7 8 9 -0.67125 0.96272 -0.16965 -0.21999 -0.95552 1.04939 0.84715 -0.09167 -0.96656 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.045e+00 1.709e-01 17.815 <2e-16 *** outcome2 -4.543e-01 2.022e-01 -2.247 0.0246 * outcome3 -2.930e-01 1.927e-01 -1.520 0.1285 treatment2 1.338e-15 2.000e-01 0.000 1.0000 treatment3 1.421e-15 2.000e-01 0.000 1.0000 --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 10.5814 on 8 degrees of freedom Residual deviance: 5.1291 on 4 degrees of freedom AIC: 56.761 Number of Fisher Scoring iterations: 4 # AICã§ã¢ãã«ã®æ±åæ§è½ããResidual devianceã¨degrees of freedomã¨ã®æ¯çããoverdispersionãã©ããããã§ãã¯ãã > hist(counts,breaks=50) # ãã¹ãã°ã©ã ããããã
outcome2ãéè¦ã ã¨ãããã¨ã«ãªãã¾ããããªããä¸è¨éå»è¨äºã«ãããããã«ãã¢ã½ã³å帰ã¯ç®ç夿°ã«0ããããå¤ãå ´åã¯ãã¾ããã£ããã£ã³ã°ããªãã®ã§ããã®æã¯è² ã®äºé
åå¸ã«åºã¥ãGLMãç¨ããå¿
è¦ãããã¾ããRã ã¨{VGAM}ããã±ã¼ã¸ã®glm.nb颿°ã§è¨ç®ã§ãã¾ãã
æ£ååï¼L1 / L2ãã«ã ï¼
ãã®è¾ºããæ©æ¢°å¦ç¿ã®è¦ç´ ãæ®µã
å¢ãã¦ãã¾ãããã®ç¬¬ä¸æ©ã¨ãã¦åãä¸ããã®ãããã®ãæ£ååãã
ããã¢ãã«ãéå»ã®ãã¼ã¿ã«ã°ããå½ã¦ã¯ã¾ããè¯ãã ãã§ãªãæªç¥ã®ãã¼ã¿ã«å¯¾ãã¦ãããç¨åº¦ãã¡ãã¨å½ã¦ã¯ã¾ã度åãã®ãã¨ããæ±åæ§è½ï¼è½åï¼ãã¨å¼ã³ã¾ããããã®æ±åæ§è½ãåä¸ãããã¨ããç¹ã§éè¦ãªã®ããæ£ååãã¨ããææ³ã§ãã詳細ã¯ãã¡ãã®éå»è¨äºããèªã¿ä¸ãããå¹³ããè¨ãã°ããã©ã¡ã¼ã¿ãæ¨å®ããéã«å¶ç´æ¡ä»¶ãåµãããã¨ã§ãå¿ è¦ä»¥ä¸ã«éå»ãã¼ã¿ã®ãã¤ãºã«ãã£ããããªãããã«ããã工夫ããªãã¨ã
ç¹ã«æ©æ¢°å¦ç¿ã«ããã¦å¤§äºãªã®ã¯ãå¦ç¿ãã¼ã¿ã¨ï¼ã¢ãã«ã®æ§è½è©ä¾¡ã®ããã®ï¼ãã¹ããã¼ã¿ã¯å¿ ãå¥ã«åãããï¼äº¤å·®æ¤è¨¼ï¼cross validationï¼ã¨ãããã¨ãããããããªãã¨ãå¦ç¿ãã¼ã¿ã®ãã¤ãºã«ããã£ãããã¦ãã¾ããããªã¢ãã«ã°ãããè¦ããä¸ã®æ§è½ãè¯ãè¦ãã¦ãã¾ãã¨ãããã¨ã«ãªããããªãã®ã§ãå¾¹åºãã¾ããã*7ã
ã¨ãããã¨ã§ããã®ããã°ã§ä½åº¦ãä¾é¡ã¨ãã¦æãã¦ããç·å¥³ããã¹å大大ä¼ã®ãã¼ã¿ã»ããã使ã£ã¦ã¿ããã¨æãã¾ããç·åã®ãã¼ã¿ãå¦ç¿ãã¼ã¿ã¨ãã女åã®ãã¼ã¿ã®åæãäºæ¸¬ããã¨ãã交差æ¤è¨¼ã®å»ºã¦ä»ãã§ãã£ã¦ã¿ã¾ããããããã§ã¯ä¾ã¨ãã¦ãä¸è¦ãªèª¬æå¤æ°ãåããã¿ã¤ãã®L1æ£ååï¼Lassoå帰ï¼ãå®è·µãã¦ã¿ã¾ãããå ¨ä½ã調æ´ãããã¿ã¤ãã®L2æ£ååï¼Ridgeå帰ï¼ã¯ã¹ãã¼ã¹ã®é½åãããã®ã§ããã§ã¯å²æã¨ãããã¨ã§ã
> dm<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/tennis/men.txt',header=T,sep='\t') > dw<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/exp_uci_datasets/tennis/women.txt',header=T,sep='\t') > dm<-dm[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)] > dw<-dw[,-c(1,2,16,17,18,19,20,21,34,35,36,37,38,39)] # L1æ£åå > library(glmnet) > dm.cv.glmnet<-cv.glmnet(as.matrix(dm[,-1]),as.matrix(dm[,1]),family="binomial",alpha=1) # alpha=1ã§L1æ£ååãalpha=0ã§L2æ£ååããã®éãªãelastic net # cv.glmnet颿°ã¯äº¤å·®æ¤è¨¼ã«ãã£ã¦ãã©ã¡ã¼ã¿ã®æé©å¤ã®æ¢ç´¢ãåæã«ãã£ã¦ããã > plot(dm.cv.glmnet) > coef(dm.cv.glmnet,s=dm.cv.glmnet$lambda.min) # s弿°ã«æ¢ç´¢ãããã©ã¡ã¼ã¿ã®æé©å¤ãå ¥ãã 25 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) 3.533402e-01 FSP.1 3.805604e-02 FSW.1 1.179697e-01 SSP.1 -3.275595e-05 SSW.1 1.475791e-01 ACE.1 . DBF.1 -8.934231e-02 WNR.1 3.628403e-02 UFE.1 -7.839983e-03 BPC.1 3.758665e-01 BPW.1 2.064167e-01 NPA.1 . NPW.1 . FSP.2 -2.924528e-02 FSW.2 -1.568441e-01 SSP.2 . SSW.2 -1.324209e-01 ACE.2 1.233763e-02 DBF.2 4.032510e-02 WNR.2 -2.071361e-02 UFE.2 -6.114823e-06 BPC.2 -3.648171e-01 BPW.2 -1.985184e-01 NPA.2 . NPW.2 1.340329e-02 > table(dw$Result,round(predict(dm.cv.glmnet,as.matrix(dw[,-1]),s=dm.cv.glmnet$lambda.min,type='response'),0)) 0 1 0 215 12 1 18 207 > sum(diag(table(dw$Result,round(predict(dm.cv.glmnet,as.matrix(dw[,-1]),s=dm.cv.glmnet$lambda.min,type='response'),0))))/nrow(dw) [1] 0.9336283 # æ£çç93.4% # æ¯è¼ï¼æ®éã®ãã¸ã¹ãã£ãã¯å帰ã®å ´å > dm.glm<-glm(Result~.,dm,family=binomial) > table(dw$Result,round(predict(dm.glm,newdata=dw[,-1],type='response'))) 0 1 0 211 16 1 17 208 > sum(diag(table(dw$Result,round(predict(dm.glm,newdata=dw[,-1],type='response')))))/nrow(dw) [1] 0.9269912 # æ£çç92.7%
確ãã«ãL1æ£ååã§ä¸è¦ãªãã©ã¡ã¼ã¿ãåã£ãæã®æ¹ããæ®éã®ãã¸ã¹ãã£ãã¯åå¸°ã®æãããããã¹ããã¼ã¿ã®äºæ¸¬æ£ççã§ä¸åã£ã¦ãã¾ããããã ããæ±åæ§è½ããæ¹åããã¨ãããã¨ãè¨ããããã§ãã
ãã®ä»ã®GLM
åºæ¬çã«ã¯Rã®glm颿°ã®family弿°ã«åºã¦ãããã®ãè¦ãã¦ããã°ãå®ç¨ä¸ã¯åé¡ã¯ãªããã¨æãã¾ããããã¦è¨ãã°ä¾ãã°quasi-poissonã使ãã¹ãããglm.nbã使ãã¹ããã¿ããã«å¾®å¦ãªã¨ããã§è¿·ãã±ã¼ã¹ã¯ããããããã¾ããããããããæã¯åã
ã®ã±ã¼ã¹ãã¨ã«ã¾ãæ¹ãã¦åå¼·ããã°äºè¶³ãããã¨ã
ã©ã³ãã ãã©ã¬ã¹ã
ãããããããã¯ã¬ããã¬ãã®æ©æ¢°å¦ç¿ã®åºçªããããããã¿ã¼ã¯ãå²ã¨è²ã
ãªæ©æ¢°å¦ç¿ãã¼ã¹ã®æ¬çªç°å¢ã§ãå¤ç¨ããã¦ããã©ã³ãã ãã©ã¬ã¹ããä»ãbaggingåã®ã¢ã³ãµã³ãã«å¦ç¿ã®ä»£è¡¨æ ¼ã¨ãããã¨ã§ããã¾ãã«ãå¤ãã®è¨èªã«ãã£ã¦ã©ã¤ãã©ãªå®è£
ãä½ãããä¸çä¸ã§ä½¿ããã¦ãã¾ãããã®ä¸èº«ã«ã¤ãã¦ã§ãããéå»ã«è¨äºãæ¸ãããã¨ãããã¾ãã®ã§ãããããã°ãã¡ãããèªã¿ãã ããã
ã¨ãããã¨ã§ãããã¾ãæ©æ¢°å¦ç¿ã®ä¾é¡ã§ã¯ãå®çªã®MNIST手書き文字認識データセットã使ã£ã¦ã©ã³ãã ãã©ã¬ã¹ãã®æ§è½ãè¦ã¦ã¿ã¾ãããããªãªã¸ãã«ãã¼ã¿ã¯é常ã«éãããã代ããã«å¦ç¿ãã¼ã¿5000è¡ããã¹ããã¼ã¿1000è¡ã«ãã¦ã³ãµã³ããªã³ã°ããã·ã§ã¼ããã¼ã¸ã§ã³ãåã®GitHubãªãã¸ããªã«ç½®ãã¦ããã¾ããã®ã§ããã¡ãã使ãã¾ãã
> train<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_train.csv') > test<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_test.csv') > train$label<-as.factor(train$label) > test$label<-as.factor(test$label) > library(randomForest) > train.rf<-randomForest(label~.,train) > table(test$label,predict(train.rf,newdata=test[,-1])) 0 1 2 3 4 5 6 7 8 9 0 96 0 0 0 0 0 3 0 1 0 1 0 99 0 0 0 0 0 0 1 0 2 0 0 96 1 1 0 1 1 0 0 3 0 0 2 87 0 4 1 1 3 2 4 0 0 0 0 96 0 1 0 0 3 5 1 2 0 1 0 94 2 0 0 0 6 0 0 1 0 1 2 95 0 1 0 7 0 2 0 0 1 0 0 93 0 4 8 0 0 1 0 0 0 0 0 99 0 9 0 0 0 0 2 1 0 1 0 96 > sum(diag(table(test$label,predict(train.rf,newdata=test[,-1]))))/nrow(test) [1] 0.951 # 95.1%ã®æ£çç
ã©ã³ãã ãã©ã¬ã¹ãã§ã¯95.1%ã®æ£ççãéæãããã¨ãã§ãã¾ãããã¡ãªã¿ã«KaggleのMNISTã³ã³ãã§ããã³ããã¼ã¯ã¨ãã¦å ¨ãåãRã®{randomForest}ããã±ã¼ã¸ãç¨ããã³ã¼ãä¾ãä¸ãããã¦ãã¾ãã
ãããããèå³ã®ããæ¹ã¯MNISTææ¸ãæåãæç»ããã®ãè¯ãããããã¾ãããRã§ã¯é¢åã§ãããããªæãã§ããããã§ãã¾ãã
> par(mfrow=c(3,4)) > for(i in 1:10){ + image(t(apply(matrix(as.vector(as.matrix(train[(i-1)*500+50,-1])),ncol=28,nrow=28,byrow=T),2,rev)),col=grey(seq(0,1,length.out=256))) + }
çµæ§çããåãæ±ãããã§ï¼ç¬ï¼ãä¸ã«ã¯ããã®ç®ã§è¦ã¦ãå¤å¥ãã¤ããªãï¼æ¯ããã¦ããæ£è§£ã©ãã«ãè¦ã¦åãã¦åããï¼ã¨ããã¬ãã«ã®æ±ãåããã£ãããã¾ããããã§ãæ©æ¢°å¦ç¿ã§åé¡ãã¦ã¿ããï¼ã¨ããã®ããã®MNISTã³ã³ãã®ééå³ãªã®ã ã¨ãã
Xgboostï¼å¾é ãã¼ã¹ãã£ã³ã°æ¨ï¼
䏿¹ã§ãè¿å¹´KaggleãKDD Cupãªã©ã®ã³ã³ãã§ã«ããã«æ³¨ç®ãéãã¦ããã®ãxgboostãããã¯å¾æ¥ããç¥ããã¦ããå¾é
ãã¼ã¹ãã£ã³ã°æ¨(gradient boodted trees)ãããé«éãªå®è£
ã«ããã©ã¤ãã©ãªã§ããããã¾ãã«ãå種ã³ã³ãã§å§åçãªããã©ã¼ãã³ã¹ãå©ãåºãããççãªå¢ãã§ä¸çä¸ã«åºã¾ãã¤ã¤ããã¾ããä½è«ã§ãããããããã¾ã§è±èªçããã°ã®xgboostã®è¨äºã¯è±èªéå®ã§'xgboost'ã§ã°ã°ã£ãæã«ä¸ãã2ã3çªç®ã«åºã¦ãã¾ãï¼ç¬ï¼ããªãããã®ä¸èº«ã«ã¤ãã¦ã¯éå»è¨äºã§ä¸åº¦è§£èª¬ãããã¨ãããã¾ãã®ã§ããããããã°ãã¡ããã©ããã
ã¨ãããã¨ã§ãã©ã³ãã ãã©ã¬ã¹ãã®æã¨åæ§ã«MNISTã®ã·ã§ã¼ããã¼ã¸ã§ã³ã§è©¦ãã¦ã¿ã¾ããã¡ãªã¿ã«ä¸è¨ã®çµæã¯ããç¸å¿ã«ãã¥ã¼ãã³ã°ããå¾ã®å¸°çµã§ããæ¨æªãããããäºæ¿ãã ããï¼ç¬ï¼ã
> train<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_train.csv') > test<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_test.csv') > library(xgboost) > library(Matrix) # ãã¼ã¿ã®åå¦çã«å¿ è¦ > train.mx<-sparse.model.matrix(label~., train) > test.mx<-sparse.model.matrix(label~., test) # ãã¼ã¿ã»ãããxgboostã§æ±ããå½¢å¼ã«ç´ã > dtrain<-xgb.DMatrix(train.mx, label=train$label) > dtest<-xgb.DMatrix(test.mx, label=test$label) # è²ã ãã©ã¡ã¼ã¿ããããxgboostã®GitHubãªã©ãåç §ã®ã㨠> train.gbdt<-xgb.train(params=list(objective="multi:softmax", num_class=10, eval_metric="mlogloss", eta=0.3, max_depth=5, subsample=1, colsample_bytree=0.5), data=dtrain, nrounds=70, watchlist=list(train=dtrain,test=dtest)) [0] train-mlogloss:1.439942 test-mlogloss:1.488160 [1] train-mlogloss:1.083675 test-mlogloss:1.177975 [2] train-mlogloss:0.854107 test-mlogloss:0.977648 # ... omitted ... [67] train-mlogloss:0.004172 test-mlogloss:0.176068 [68] train-mlogloss:0.004088 test-mlogloss:0.176044 [69] train-mlogloss:0.004010 test-mlogloss:0.176004 > table(test$label,predict(train.gbdt,newdata=dtest)) 0 1 2 3 4 5 6 7 8 9 0 95 0 0 1 0 0 3 0 1 0 1 0 99 0 0 0 0 0 1 0 0 2 0 0 96 2 0 0 1 1 0 0 3 0 0 1 93 0 0 0 1 2 3 4 0 0 1 1 95 0 1 0 0 2 5 0 1 0 1 0 98 0 0 0 0 6 0 0 1 0 1 2 95 0 1 0 7 0 0 0 0 1 0 0 96 0 3 8 0 4 1 0 1 0 0 0 93 1 9 0 0 0 0 4 1 0 2 0 93 > sum(diag(table(test$label,predict(train.gbdt,newdata=dtest))))/nrow(test) [1] 0.953 # æ£çç95.3%
95.3%ã¨ãããã¨ã§ãxgboostã§ã©ã³ãã ãã©ã¬ã¹ããä¸åããã¨ãåºæ¥ã¾ããããã ãã©ã³ãã ãã©ã¬ã¹ãããããã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ã«ä¾åããé¨åãçµæ§å¤§ããã®ã§ãæã£ãã»ã©ç²¾åº¦ãåºãªãã±ã¼ã¹ãã¾ã¾ããç¹ã«ã¯æ³¨æãå¿
è¦ã§ããã©ã³ãã ã·ã¼ãã«ä¾åããé¨åãããã¾ã*8ã
Deep Learning
ä»ãã人工ç¥è½ãã®ä»£åè©ã«ããªãã¤ã¤ããDeep Learningãå½åDeep Neural Network (DNN)ã ãã ã£ããã®ããç»åèªèã«éå®ãããConvolutional Neural Network (CNN)ãæç« ãé³å£°ãªã©ç³»åãã¼ã¿ã«å¼·ãRecurrent Neural Network (RNN)ãªã©ã©ãã©ããã®æ´¾çå½¢ãåºã¾ãã¤ã¤ããã¾ãã
ã¨ãããã¨ã§ç¾å¨ã®ã¨ããåæã®å®è£ ã©ã¤ãã©ãªã ã£ãTheano, PyLearn2ã«æ¯ã¹ã¦Torch, Caffe, Chainerããã¦TensorFlow, CNTKã¨é常ã«å¤å½©ãªå®è£ ãåæããæä¾ããã¦ãããããã§ï¼ç¹ã«ï¼C++, Pythonç°å¢ã§ãã¤GPUç°å¢ãç¨æã§ããã°ããªããæè»½ã«Deep Learningãå°å ¥ã»å®è·µãããã¨ãã§ãã¾ãããããã®å®è£ ã«é¢ãã解説ãããã«ã¯çè«ã«é¢ãã解説ãã°ã°ãã°å±±ã»ã©åºã¦ãã¾ãã®ã§ããã®è¨äºã§ã¯å²æãã¾ãã2014年のJapan.Rで僕が話したDeep Learningに関する簡単なトークもあるã¨è¨ãã°ããã¾ãããæ¢ã«ç¾å¨ã®ãã¬ã³ãããã ãã¶ç½®ãå»ãã«ããã¦ããã®ã§ããã¾ã§ãåèç¨åº¦ã«ããã
ããã¦ä»åã¯ãã並ã¿ã®å®è£ ã試ãã¨ããè¶£æ¨ã§ããªãã®ã§ãç¾ç¶æãRããã®å®è¡ãç°¡åãªH2Oã®Rããã±ã¼ã¸{h2o}ãç¨ãã簡便ãªDNNã®å®è·µä¾ã®ã¿ç´¹ä»ãã¾ãããã¡ããMNISTã®ã·ã§ã¼ããã¼ã¸ã§ã³ã§ãã£ã¦ã¿ã¾ãããã
ï¼â»ã©ãããH2Oã®å®è£
å
¨ä½ãæè¿ã®ã¢ãããã¼ãã§ããªãã®æ¹å¤ããã£ãããã§ã以åã®è¨äºã®ã³ã¼ãã§ã¯åããªãããã«ãªã£ã¦ãã¾ãããã¡ãã2016å¹´3æç¾å¨åãã³ã¼ãã§ãã®ã§ãã¾ãã¯ãã¡ããã試ããã ããï¼
> train<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_train.csv') > test<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_test.csv') > train$label<-as.factor(train$label) > test$label<-as.factor(test$label) > library(h2o) # Java VMã®ã¤ã³ã¹ã¿ã³ã¹ãç«ã¦ã > localH2O <- h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, nthreads=3) # ç¾è¡ãã¼ã¸ã§ã³ã§ã¯as.h2o颿°ã§Rãªãã¸ã§ã¯ããç´æ¥èªã¿è¾¼ãã > trData<-as.h2o(train) > tsData<-as.h2o(test) # 以䏿é©åå¨ãã®ãã©ã¡ã¼ã¿æå®ãå±±ã»ã©ä¸¦ãã§ããã®ã§ãDeep Learningã®å°éæ¸ãªã©ãåç §ã®ã㨠> res.dl <- h2o.deeplearning(x = 2:785, y = 1, training_frame = trData, activation = "RectifierWithDropout",hidden=c(1024,1024,2048),epochs = 300, adaptive_rate = FALSE, rate=0.01, rate_annealing = 1.0e-6,rate_decay = 1.0, momentum_start = 0.5,momentum_ramp = 5000*18, momentum_stable = 0.99, input_dropout_ratio = 0.2,l1 = 1.0e-5,l2 = 0.0,max_w2 = 15.0, initial_weight_distribution = "Normal",initial_weight_scale = 0.01,nesterov_accelerated_gradient = T, loss = "CrossEntropy", fast_mode = T, diagnostics = T, ignore_const_cols = T,force_load_balance = T) > pred<-h2o.predict(res.dl,tsData[,-1]) > pred.df<-as.data.frame(pred) > table(test$label,pred.df[,1]) 0 1 2 3 4 5 6 7 8 9 0 96 0 1 0 0 0 2 1 0 0 1 0 100 0 0 0 0 0 0 0 0 2 0 0 97 0 2 0 0 1 0 0 3 0 0 1 93 0 4 0 1 0 1 4 0 2 1 0 93 0 0 1 1 2 5 0 0 0 1 0 99 0 0 0 0 6 1 0 0 0 0 2 97 0 0 0 7 0 0 0 0 1 0 0 96 0 3 8 0 0 1 1 1 2 0 0 95 0 9 0 0 0 0 2 0 0 2 0 96 > sum(diag(table(test$label,pred.df[,1])))/nrow(test) [1] 0.962 # æ£çç96.2%
æµç³ã«Deep Learningã¨ãããDNNã®é¢ç®ãä¿ã£ã¦ã96.2%ã¨ãã¤ã¹ã³ã¢ãå©ãåºãã¾ãã*9ãã¡ãªã¿ã«ãã®DNNã¯ä¸é層ã1024, 1024, 2048ã¦ãããããæããæ´»æ§å颿°ã¯'Rectifier'*10ãããã¦dropout ratioãæ£åå广æå¤§ã®0.5ã«åºå®ãã5層ãããã¯ã¼ã¯ã§ãã
追è¨ï¼MXnetã®Rããã±ã¼ã¸{mxnet}ãç¨ããConvolutional Neural Networkã«ããä¾
å¾ã®è¨äºã§åãä¸ããæ¹æ³ã«ãããã®ã§ãã詳細ã¯ãã¡ãããåç
§ãã ããã
# Installation > install.packages("drat", repos="https://cran.rstudio.com") > drat:::addRepo("dmlc") > install.packages("mxnet") > library(mxnet) # Data preparation > train<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_train.csv') > test<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/jp/mnist_reproduced/short_prac_test.csv') > train<-data.matrix(train) > test<-data.matrix(test) > train.x<-train[,-1] > train.y<-train[,1] > train.x<-t(train.x/255) > test_org<-test > test<-test[,-1] > test<-t(test/255) > devices <- mx.cpu() > mx.set.seed(0) > data <- mx.symbol.Variable("data") > # first conv > conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20) > tanh1 <- mx.symbol.Activation(data=conv1, act_type="relu") > pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max", + kernel=c(2,2), stride=c(2,2)) > drop1 <- mx.symbol.Dropout(data=pool1,p=0.5) > # second conv > conv2 <- mx.symbol.Convolution(data=drop1, kernel=c(5,5), num_filter=50) > tanh2 <- mx.symbol.Activation(data=conv2, act_type="relu") > pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max", + kernel=c(2,2), stride=c(2,2)) > drop2 <- mx.symbol.Dropout(data=pool2,p=0.5) > # first fullc > flatten <- mx.symbol.Flatten(data=drop2) > fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500) > tanh4 <- mx.symbol.Activation(data=fc1, act_type="relu") > drop4 <- mx.symbol.Dropout(data=tanh4,p=0.5) > # second fullc > fc2 <- mx.symbol.FullyConnected(data=drop4, num_hidden=10) > # loss > lenet <- mx.symbol.SoftmaxOutput(data=fc2) > train.array <- train.x > dim(train.array) <- c(28, 28, 1, ncol(train.x)) > test.array <- test > dim(test.array) <- c(28, 28, 1, ncol(test)) > mx.set.seed(0) > tic <- proc.time() > model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y, + ctx=devices, num.round=60, array.batch.size=100, + learning.rate=0.05, momentum=0.9, wd=0.00001, + eval.metric=mx.metric.accuracy, + epoch.end.callback=mx.callback.log.train.metric(100)) Start training with 1 devices [1] Train-accuracy=0.0975510204081633 # omitted # [60] Train-accuracy=0.9822 > print(proc.time() - tic) ã¦ã¼ã¶ ã·ã¹ãã çµé 784.666 3.767 677.921 > preds <- predict(model, test.array, ctx=devices) > pred.label <- max.col(t(preds)) - 1 > table(test_org[,1],pred.label) pred.label 0 1 2 3 4 5 6 7 8 9 0 99 0 0 0 0 0 1 0 0 0 1 0 99 0 0 1 0 0 0 0 0 2 0 0 98 0 0 0 0 1 1 0 3 0 0 0 98 0 1 0 0 1 0 4 0 2 0 0 97 0 1 0 0 0 5 0 0 0 0 0 99 1 0 0 0 6 0 0 0 0 0 0 100 0 0 0 7 0 0 0 0 0 0 0 99 1 0 8 0 0 0 0 0 0 0 0 100 0 9 0 0 0 0 2 0 0 0 0 98 > sum(diag(table(test_org[,1],pred.label)))/1000 [1] 0.987 # æ£çç98.7%
æ´»æ§å颿°ã«ReLUã鏿ããç³ã¿è¾¼ã¿å±¤ã2層ãå
¨çµå層ã2層ãåºåãsoftmax颿°ã«ããããããLeNetã§ã98.7%ã¨ãããããã®ãµã³ãã«ãµã¤ãºã§ã¯ã»ã¼éçã«è¿ã精度ãåºã¦ãã¾ãããããCNNã¨ãã£ãã¨ããã§ããããã
MCMCã«ãããã¤ã¸ã¢ã³ã¢ããªã³ã°
åãåè¡å§å¡ã¨ãã¦åç»ãã¦ãã岩波DS第1å·»ã§ãç¹éããã¦ãã¾ããããMCMCãç¨ãããã¤ã¸ã¢ã³ã¢ããªã³ã°ã¯é常ã®*11ç·å½¢å帰ã¢ãã«ãä¸è¬åç·å½¢ã¢ãã«ã§ã¯æ¯ãç«ããªããããªè¤éãªã¢ãã«ãæ±ãã®ã«é©ãã¦ãã¾ãã
ã¨ããã§ããã¤ã¦ã¯BUGSã§ã®å®è·µãä¸è¬çã§ããããå®è¡éåº¦ãæ¯è¼çé ããã¨ãããã¦ä½ãããç¾å¨ã¯éçºãæ¢ã¾ã£ã¦ããããã«æå 端ã®çè«ãå®è£ ãåæ ãããªããªã©ã®åé¡ãããã¾ãããç¾å¨ã¯JAGSããã¦Stanã¨ããé«éãªå®è£ ã容æã«å ¥æã§ããããããã¡ããè¦ãããããã¨ã®æ¹ãå¤ãããã§ãããã®ããã°ã®éå»ã®è¨äºã§ãã·ãªã¼ãºã¨ãã¦åãä¸ãã¦ãã¾ãã
ã¨ãããã¨ã§ããã§ã¯ç°¡åãªStanã®å®è¡ä¾ãæãã¦ããã¾ãã以åã®è¨äºã§ç¨ãããµã³ãã«ãã¼ã¿ã»ãããå°ãæ¹å¤ãããã®ã«å¯¾ãã¦ãå£ç¯èª¿æ´ï¼äºéå·®åãã¬ã³ãã¢ãã«ã§ãã©ã¡ã¼ã¿æ¨å®ãè¡ã£ãä¸ã§ããã£ããã£ã³ã°ããçµæãåããããã®ã§ããã¤ã¡ã¼ã¸ã¨ãã¦ã¯ãåºåã3ç¨®é¡æ¯æ¥æä¸éãå¤ããªããæä¸ãã¦ãã£ãå ´åã®CVæ°ãã¢ããªã³ã°ãããã¨ãããã®ã§ããæ¥æ¬¡ãã¼ã¿ãªã®ã§ãææ¥å¤åãåºåã¨ã¯ç¡é¢ä¿ãªãããããã®ãããªãã®ãå«ããã¼ã¿ã§ããããã¢ããªã³ã°ããã¨ããæé ã§ãã以ä¸Stanã³ã¼ããRã³ã¼ãã®é ã«æãã¦ããã¾ãã
data { int<lower=0> N; real<lower=0> x1[N]; real<lower=0> x2[N]; real<lower=0> x3[N]; real<lower=0> y[N]; } parameters { real wk[N]; real trend[N]; real s_trend; real s_q; real s_wk; real<lower=0> a; real<lower=0> b; real<lower=0> c; real d; } model { real q[N]; real cum_trend[N]; for (i in 7:N) wk[i]~normal(-wk[i-1]-wk[i-2]-wk[i-3]-wk[i-4]-wk[i-5]-wk[i-6],s_wk); // 卿7ã®å£ç¯èª¿æ´ï¼ææ¥å¤åï¼ for (i in 3:N) trend[i]~normal(2*trend[i-1]-trend[i-2],s_trend); // äºéå·®åãã¬ã³ã cum_trend[1]<-trend[1]; for (i in 2:N) cum_trend[i]<-cum_trend[i-1]+trend[i]; for (i in 1:N) q[i]<-y[i]-wk[i]-cum_trend[i]; // ç®ç夿°ãå帰é¨åãå£ç¯èª¿æ´ããã¬ã³ãã«åè§£ãã for (i in 1:N) q[i]~normal(a*x1[i]+b*x2[i]+c*x3[i]+d,s_q); // å帰é¨åã®ãµã³ããªã³ã° }
> d<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/DM_sampledata/example_bayesian_modeling.csv') > dat<-list(N=nrow(d),y=d$cv,x1=d$ad1,x2=d$ad2,x3=d$ad3) > library(rstan) # 並ååãªãã·ã§ã³ãå ¥ãã > rstan_options(auto_write = TRUE) > options(mc.cores = parallel::detectCores()) > fit<-stan(file='https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/DM_sampledata/hb_trend_cum_wk.stan',data=dat,iter=1000,chains=4) starting worker pid=4813 on localhost:11406 at 00:03:29.822 starting worker pid=4821 on localhost:11406 at 00:03:30.007 starting worker pid=4829 on localhost:11406 at 00:03:30.188 starting worker pid=4837 on localhost:11406 at 00:03:30.370 SAMPLING FOR MODEL 'hb_trend_cum_wk' NOW (CHAIN 1). Chain 1, Iteration: 1 / 1000 [ 0%] (Warmup) SAMPLING FOR MODEL 'hb_trend_cum_wk' NOW (CHAIN 2). Chain 2, Iteration: 1 / 1000 [ 0%] (Warmup) SAMPLING FOR MODEL 'hb_trend_cum_wk' NOW (CHAIN 3). Chain 3, Iteration: 1 / 1000 [ 0%] (Warmup) SAMPLING FOR MODEL 'hb_trend_cum_wk' NOW (CHAIN 4). Chain 4, Iteration: 1 / 1000 [ 0%] (Warmup) # ... ä¸ç¥ ... # Chain 3, Iteration: 1000 / 1000 [100%] (Sampling)# # Elapsed Time: 34.838 seconds (Warm-up) # 16.5852 seconds (Sampling) # 51.4232 seconds (Total) # Chain 4, Iteration: 1000 / 1000 [100%] (Sampling)# # Elapsed Time: 42.5642 seconds (Warm-up) # 46.8373 seconds (Sampling) # 89.4015 seconds (Total) # Chain 2, Iteration: 1000 / 1000 [100%] (Sampling)# # Elapsed Time: 47.8614 seconds (Warm-up) # 44.052 seconds (Sampling) # 91.9134 seconds (Total) # Chain 1, Iteration: 1000 / 1000 [100%] (Sampling)# # Elapsed Time: 41.7805 seconds (Warm-up) # 50.8883 seconds (Sampling) # 92.6688 seconds (Total) # # 以ä¸äºå¾åå¸ã®æé »å¤ããã©ã¡ã¼ã¿æ¨å®çµæã¨ãã¦åãåºãããã»ã¹ > fit.smp<-extract(fit) > dens_a<-density(fit.smp$a) > dens_b<-density(fit.smp$b) > dens_c<-density(fit.smp$c) > dens_d<-density(fit.smp$d) > a_est<-dens_a$x[dens_a$y==max(dens_a$y)] > b_est<-dens_b$x[dens_b$y==max(dens_b$y)] > c_est<-dens_c$x[dens_c$y==max(dens_c$y)] > d_est<-dens_d$x[dens_d$y==max(dens_d$y)] > trend_est<-rep(0,100) > for (i in 1:100) { + tmp<-density(fit.smp$trend[,i]) + trend_est[i]<-tmp$x[tmp$y==max(tmp$y)] + } > week_est<-rep(0,100) > for (i in 1:100) { + tmp<-density(fit.smp$wk[,i]) + week_est[i]<-tmp$x[tmp$y==max(tmp$y)] + } > pred<-a_est*d$ad1+b_est*d$ad2+c_est*d$ad3+d_est+cumsum(trend_est)+week_est > matplot(cbind(d$cv,pred,d_est+cumsum(trend_est)),type='l',lty=1,lwd=c(2,3,2),col=c('black','red','#008000'),ylab="CV") > legend("topleft",c("Data","Predicted","Trend"),col=c('black','red','#008000'),lty=c(1,1),lwd=c(2,3,2),cex=1.2)
ãããã«å®ç§ã«ã¯ãã£ããã£ã³ã°åºæ¥ã¦ãã¾ããããä½ã®äºåç¥èããªããã å£ç¯èª¿æ´ï¼äºéå·®åãã¬ã³ãã®é
ãStanã³ã¼ãä¸ã§çãè¾¼ãã§ãã ãµã³ããªã³ã°ãã¦ãã©ã¡ã¼ã¿æ¨å®ããã ãã®çµæã¨ãã¦ã¯ä¸åºæ¥ã ã¨æãã¾ãã
追è¨ï¼ãã¯ãã«åããStanã¹ã¯ãªããã®ä¾
ä¸è¨ã®ããæ¹ã ã¨ã¾ã ãã£ãããã®ã§ããããªæãã«ããã¨ããå°ãã·ã³ãã«ã«ãªãä¸ã«åæ°ãå¤ãã£ã¦ãèéãå©ãããã«ãªãã¾ããåå¿é²çã«ã
data { int<lower=0> N; int<lower=0> M; matrix[N, M] X; real y[N]; } parameters { real trend[N]; real season[N]; real s_trend; real s_q; real s_season; vector<lower=0>[M] beta; real d; } model { real q[N]; real cum_trend[N]; for (i in 7:N) { season[i]~normal(-season[i-1]-season[i-2]-season[i-3]-season[i-4]-season[i-5]-season[i-6],s_season); } for (i in 3:N) trend[i]~normal(2*trend[i-1]-trend[i-2],s_trend); cum_trend[1]<-trend[1]; for (i in 2:N) cum_trend[i]<-cum_trend[i-1]+trend[i]; for (i in 1:N) q[i]<-y[i]-cum_trend[i]-season[i]; for (i in 1:N) q[i]~normal(dot_product(X[i], beta)+d,s_q); }
ãã®Stanã¹ã¯ãªããã'v2.stan'ã¨ããååã§ä¿åããä¸ã§ã以ä¸ã®ããã«kickãã¾ãã
> d<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/DM_sampledata/example_bayesian_modeling.csv') > dy<-d$cv > dvar<-d[,-1] > d.dat<-list(N=nrow(dvar), M=ncol(dvar), X=dvar, y=dy) > library(rstan) # 並ååãªãã·ã§ã³ãå ¥ãã > rstan_options(auto_write = TRUE) > options(mc.cores = parallel::detectCores()) > fit <- stan(file='v2.stan', data=d.dat, iter=1000, chains=4) > fit.smp<-extract(fit) > t_d<-density(fit.smp$d) > d_est<-t_d$x[t_d$y==max(t_d$y)] > beta<-rep(0,ncol(dvar)) > for (i in 1:ncol(dvar)) { > tmp<-density(fit.smp$beta[(2000*(i-1)+1):(2000*i)]) > beta[i]<-tmp$x[tmp$y==max(tmp$y)] > } > trend<-rep(0,nrow(dvar)) > for (i in 1:nrow(dvar)) { > tmp<-density(fit.smp$trend[,i]) > trend[i]<-tmp$x[tmp$y==max(tmp$y)] > } > season<-rep(0,nrow(dvar)) > for (i in 1:nrow(dvar)) { > tmp<-density(fit.smp$season[,i]) > season[i]<-tmp$x[tmp$y==max(tmp$y)] > } > beta_prod<-rep(0,nrow(dvar)) > for (i in 1:ncol(dvar)){beta_prod<-beta_prod + dvar[,i]*beta[i]} > pred <- d_est + beta_prod + cumsum(trend) + season > matplot(cbind(dy,pred,d_est+cumsum(trend)),type='l',lty=1,lwd=c(2,3,2),col=c('black','red','#008000'),ylab="CV") > legend("topleft",c("Data","Predicted","Trend"),col=c('black','red','#008000'),lty=c(1,1),lwd=c(2,3,2),cex=1.2)
ãããªãè¡æ°ã»åæ°ãå¤ãã£ã¦ããã®ã¾ã¾åãã¾ãããä½ãããStanæ¼ç®èªä½ãéããªãã¾ãã
word2vec
2013年以éèªç¶è¨èªå¦çã®åéã«åºã¾ã£ãã®ãword2vecããã®åã®éããåèªãæ°å¤ã§è¡¨ãããã¯ãã«ã«è¡¨ç¾ãç´ããã¨ã§ãä¼¼ã¦ããåèªã®ãªã¹ãã¢ãããããåèªã®æå³ã®è¶³ãå¼ãããåºæ¥ãããã«ãããææ³ã§ãããããããæ§ã
ãªèªç¶è¨èªå¦çãã¼ã¿ã®åå¦çãç¹å¾´é使ã«ç¨ãããããã¨ãå¤ãæ§ã§ãã以åãã®ããã°ã§ãç°¡åã«åãä¸ãããã¨ãããã¾ãã
ã¨ãããã¨ã§ãããã§ãç°¡åã«è©¦ãã¦ã¿ã¾ããããã¨ããããç©åãæã£åãæ©ãä¾ã¨ãã¦ããã®ããã°ã§ã¯ã¦ã500以ä¸ãåã£ãè¨äºã®ä¸ããä¸å¤ªè©±ç³»ã®ãã®ãéãã¦ãã¦ï¼ç¬ï¼ãã¾ã¨ãããã®ãGitHubãªãã¸ããªã«ç½®ãã¦ããã¾ããã®ã§ãããããã¼ã«ã«ã«DLãã¦ãã¦ãã ããããã®ä¸ã§MeCabã§åãã¡æ¸ããããä¸ã§ãword2vecã«ããã¦ããã¾ãããªãword2vecã®å®è£ ã¨ãã¦ã¯ãPythonä¸ã§easy_installã§ç°¡åã«ã¤ã³ã¹ãã¼ã«ã§ããgensimããå©ç¨ããã®ãä»ã®ã¨ããæãç°¡åãªã®ã§ãã¡ãã使ãã¾ããã¤ã³ã¹ãã¼ã«æ¹æ³ãªã©ã¯ä¸è¨ããã°è¨äºããåç §ããã
ã¾ãã¯MeCabã§åãã¡æ¸ãã
$ mecab -Owakati tjo_stories.txt -o tjo_stories_token.txt
ãã®ä¸ã§Pythonã§ä»¥ä¸ã®ããã«å®è¡ãã¾ãã
from gensim.models import word2vec data = word2vec.Text8Corpus('tjo_stories_token.txt') model = word2vec.Word2Vec(data, size = 100) out = model.most_similar(positive=[u'çµ±è¨', u'å¦']) for x in out: print x[0], x[1] åºç¤ 0.987444281578 åæ 0.98454105854 人 0.982671976089 æã 0.982490897179 æ¤å® 0.982355296612 ã㣠0.982218146324 ãããã 0.981627583504 è¨ã 0.981441140175 åå¼· 0.981229901314 䏿¹ 0.980490446091 out = model.most_similar(positive=[u'æ©æ¢°',u'å¦ç¿']) for x in out: print x[0], x[1] 㨠0.950019478798 ã 0.946986079216 ããç¨åº¦ 0.940009057522 å ´å 0.939836621284 å¦ 0.933527469635 ç 0.928611636162 çè§£ 0.925901889801 åæ 0.923837661743 ãã 0.922927498817 ã§ã 0.922164022923 out = model.most_similar(positive=[u'çµ±è¨'], negative = [u'ãã¼ã¿']) for x in out: print x[0], x[1] ã 0.924124836922 ã 0.907221496105 ç¹ã« 0.900418162346 ãã 0.896694540977 ã§ã 0.89433068037 ã¾ã§ 0.893202722073 ãªã 0.887368142605 ãã 0.88346517086 ã¾ã 0.877071022987 å¯è½ 0.875332713127
ãçµ±è¨-å¦ãã¯ãåºç¤ã»åæã»æ¤å®ãã¨ã®é¢é£ãå¼·ãããæ©æ¢°-å¦ç¿ãã¯ãããç¨åº¦ã»çè§£ãã¨ã®é¢é£ãå¼·ãããçµ±è¨ãããããã¼ã¿ããå·®ãå¼ãã¨è¨³ãåãããªãæãã§ãï¼æ±ï¼ããã£ã±ããã®ç¨åº¦ã ã¨ãã¼ã¿éãå°ãªéãããã§ãããããã以åã®ããã°è¨äºã®é空æåº«ãã¼ã¿ã®ä¾ã®æ¹ãã¾ã ã¾ã¨ãããããã¾ãããã
K-meansã¯ã©ã¹ã¿ãªã³ã°
æå¸«ãªãå¦ç¿ã®ä»£è¡¨æ ¼ã¨ãè¨ããã¯ã©ã¹ã¿ãªã³ã°ãææ³èªä½ã¯æãã¦ããã¨ããªããªããã¾ãæ§ã
ãªè¨èªã«ããã©ã¤ãã©ãªå®è£
ãå¤ãã®ã§ä½ãåãä¸ãã¦ãè¯ãã®ã§ãããããã§ã¯æãä¸è¬çãã¤ç°¡ä¾¿ã¨æãããK-meansã¯ã©ã¹ã¿ãªã³ã°ãåãä¸ãã¾ã*12ãããã以åã®è¨äºã§åãä¸ãããã¨ãããã¾ãã
æ¹æ³è«ã®è©³ç´°ã¯å²æãã¾ãããè¦ã¯ãäºåã«å®ããkåã®ã¯ã©ã¹ã¿ã«ãã¼ã¿èªä½ã®æ§è³ªã«åºã¥ãã¦æ¯ãåãããææ³ã§ããä¾ãã°æè5ç« ã®åå購買ã·ãã¥ã¬ã¼ã·ã§ã³ãã¼ã¿ã»ãããç¨ããã¨ãããªæãã«å®è·µã§ãã¾ãã
> d<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/DM_sampledata/ch5_3.txt',header=T,sep=' ') > d.km<-kmeans(d,centers=3) > d1<-cbind(d,d.km$cluster) > names(d1)[6]<-'cluster' > res<-with(d1,aggregate(d1[,-6],list(cluster=cluster),mean)) # ã¯ã©ã¹ã¿ãªã³ã°çµæãã¾ã¨ãã > res cluster books cloths cosmetics foods liquors 1 1 9.047619 13.57143 5.285714 4.333333 7.571429 2 2 46.060606 11.36364 4.575758 5.090909 5.242424 3 3 28.739130 10.28261 4.478261 5.043478 6.043478 > barplot(as.matrix(res[order(res$books),-6]),col=rainbow(5))
ç©åãé©å½ãªå¯è¦åãªãã§ããã4ã¤ã®è²ã«åããã3ã¤ã®ã¯ã©ã¹ã¿ãããããã©ãããã5ã«ãã´ãªããããã®ååãå«ãã§ããããè¦ã¦åãããã¨æãã¾ãããã¡ããããã£ã¨è¤éãªãã¼ã¿ã§ããã°ããã«ç´°ããç¹å¾´ãæã¤ã¯ã©ã¹ã¿å士ã«åãããã¨ãåºæ¥ã¾ãã
ã¡ãªã¿ã«ãå®åã§ç¨ããå ´åã®ããæ¹ã¨ãã¦ã¯ããã¾ãK-meansã¯ã©ã¹ã¿ãªã³ã°ã§ã¦ã¼ã¶ã¼ãé©å½ãªæ°ã®ã¯ã©ã¹ã¿ã«æ¯ãåããã¦ããã¦ãããããã®ã¯ã©ã¹ã¿ãå¦ç¿ã©ãã«ã¨ãã¦æ©æ¢°å¦ç¿åé¡å¨ãæ¨å®ãããã®åé¡å¨ãç¨ãã¦æ°è¦ã¦ã¼ã¶ã¼ãåã
ã®ã¯ã©ã¹ã¿ã«åããã¦ãããã¨ããäºæ®µæ§ãã®æ¹æ³è«ããã£ãããã¾ãããã¡ã¤ã³ç¥èã§ã¯ã¦ã¼ã¶ã¼ãªãããã¥ã¡ã³ããªãã®åé¡ãã§ããªãã±ã¼ã¹ã§ã¯æç¨ãªæ¹æ³ã§ãã
ã°ã©ãçè«ã»ãããã¯ã¼ã¯åæ
ãã¡ããªãã§ãããæè¿ã«ãªã£ã¦æ§ã
ãªæ¥çã»é åã«ã¨ãã¸ãªã¹ãåã³ãã«ã³ãé£éã§è¡¨ç¾å¯è½ãªãã¼ã¿ãæ°å¤ããããã¨ãåãã£ã¦ãããããå人çã«ãçµæ§ç¿å¾ã«åãå
¥ãã¦ããææ³ã§ãããã¾ããæ¨å¹´æ«ã«ã·ãªã¼ãºè¨äºãæ¸ããã®ã§ãææ³ã®è©³ç´°ã¯å²æã¨ãããã¨ã§ã
ãã¤ã¸ã§ã¹ãçã¨ãã¦ãæåãª"Karate"ãã¼ã¿ã»ãã*13ã«å¯¾ãã¦å®çªã¨æãããã°ã©ãçè«ã»ãããã¯ã¼ã¯åæææ³ãå¹¾ã¤ãã«éã£ã¦é©ç¨ãã¦ã¿ã¾ãããã£ã¦ãããã¨ã¯ã·ã³ãã«ã§ã{igraph}ããã±ã¼ã¸ã§ã¯åªä»ä¸å¿æ§ãç®åºãã¦ãã©ã®äººç©ãã©ããããããçãªå½¹å²ãæããã¦ãããï¼äººéé¢ä¿ã®è¦è¡ã«ãªã£ã¦ãããï¼ããå®éåããä¸ã§Fruchterman-Reingoldã¢ã«ã´ãªãºã ã§é¢é£ã®å¼·ã人ç©å士ãè¿ãã«é ç½®ãããããã«æç»ãã{linkcomm}ããã±ã¼ã¸ã§ã¯1人ãè¤æ°ã®ã³ãã¥ããã£ï¼å人ã°ã«ã¼ãï¼ã«å±ããã¨ä»®å®ããå ´åã®ã³ãã¥ããã£ã®å²ãå½ã¦ãæ¨å®ããåæ§ã«æç»ãã¦ãã¾ãã
> library(igraph) > library(linkcomm) > g<-graph.edgelist(as.matrix(karate),directed=F) > g IGRAPH U--- 34 78 -- + edges: [1] 1-- 2 1-- 3 2-- 3 1-- 4 2-- 4 3-- 4 1-- 5 1-- 6 1-- 7 5-- 7 6-- 7 1-- 8 2-- 8 3-- 8 4-- 8 [16] 1-- 9 3-- 9 3--10 1--11 5--11 6--11 1--12 1--13 4--13 1--14 2--14 3--14 4--14 6--17 7--17 [31] 1--18 2--18 1--20 2--20 1--22 2--22 24--26 25--26 3--28 24--28 25--28 3--29 24--30 27--30 2--31 [46] 9--31 1--32 25--32 26--32 29--32 3--33 9--33 15--33 16--33 19--33 21--33 23--33 24--33 30--33 31--33 [61] 32--33 9--34 10--34 14--34 15--34 16--34 19--34 20--34 21--34 23--34 24--34 27--34 28--34 29--34 30--34 [76] 31--34 32--34 33--34 > g.bw<-betweenness(g) # åªä»ä¸å¿æ§ãè¨ç®ãã > g.ocg<-getOCG.clusters(karate) # OCGã¯ã©ã¹ã¿ãæ¨å®ãã > par(mfrow=c(1,2)) > plot(g,vertex.size=g.bw/10,layout=layout.fruchterman.reingold) > plot(g.ocg,type='graph',layout=layout.fruchterman.reingold)
1çªã¨34çªããããã2ã¤ã®å¤§ããªã°ã«ã¼ãã®ã親åãæ ¼ã§ããã¨åæã«ã33çªã34çªã®å¿ å®ãªçè
ãããã¦3çªã¨32çªã大ããªã°ã«ã¼ãå士ãåãæã¤ã仲ä»å½¹ãã®ç«ã¡ä½ç½®ã«ãããã¨ã窺ãã¾ãã
ãã®ä»ã®æç¨ãªææ³ãã¡
ããã ãåãã¦ããã®ã¯ãå
è¿°ã®éãæ¥çå
¨ä½ã¨ãã¦ã¯åºã使ããã¦ãããã®ã®åèªèº«ãèªåã®æãåããã¦å®è·µãããã¨ãã¾ã ãªãææ³2種ã§ããã¨ãããã¨ã§ãä¸ã®10ææ³ã»ã©å®è·µçã§ã¯ãªãï¼ï¼ãã¥ã¼ããªã¢ã«ããªããã ãã«è¿ãï¼èª¬æã«ãããªã£ã¦ããªãã®ã§ãæªãããããäºæ¿ããã
LDAåã³ãããã¯ã¢ãã«
以åã®ç¾å ´ã§ã¯æ°åã®åãã¡ã«ãã£ã¦ããã£ã¦ããã¨ããäºæ
ããã£ã¦ãå人çã«ã¯èªåã§ã¯å®è·µãã¦ããªãå²ã«ããªã馴æã¿æ·±ãææ³ã®ä¸ã¤ãLDA (Latent Dirichlet Allocation)ããããããããã¯ã¢ãã«ã®ä»£è¡¨çãªææ³ã§ãããããã¯ã¢ãã«ã¨ã¯å¤§éæã«æ¸ãã¨ãããããã®ããããã¯ãã«åãããããããªæç« ã«ã©ã®ãããªåèªãã©ããããåºç¾ãããã¨ãã確çã表ãã¢ãã«ãã§ãä¾ãã°ææ¸ã®åé¡ã«å¤ãç¨ãããã¾ããã¾ãåèªã®ä»£ããã«ä½ãå¥ã®ãã®ããã¼ã¿ã¨ãã¦ãè¯ããæ¥çå
ã§ã¯æã
ããããªãã®ã«ãããã¯ã¢ãã«ä½¿ãã®ããï¼ï¼ï¼ãã¿ãããªäºä¾ãè³ã«ãããã¨ãããã¾ãã
ã§ãä¸è¬ã«LDAã¨ããã¨ã·ã¹ãã å®è£ ã念é ã«ç½®ãã±ã¼ã¹ãå¤ããå¿ ç¶çã«Pythonãªã©ã§ã®å®è£ ã®æ¹ãå ¥æããããã§ããä¾ãã°word2vecã®æã«ãåºã¦ããgensimã®å®è£ ãæåã§ãããã ããã¼ã«ã«ã§è©¦ãã ããªãRã«ã{lda}ããã±ã¼ã¸ããããçä¼¼äºã ããªãç°¡åã«ã§ãã¾ããã¨ããããid:MikuHatsuneããã®è¨äºããã³ã¼ãã丸ã ãªãã£ã¦ã¿ã¾ãããï¼ç¬ï¼ã
ãã ã使ããã¼ã¿ã¯'newsgroups'ã«å¤æ´ãã¾ãããä¸è¨è¨äºä¸ã§ã使ããã¦ãã'cora'åæ§{lda}ããã±ã¼ã¸ã«å梱ããã¦ããã20ã«ãã´ãª20,000æ¬ã®newsgroup*14ã®è¨äºãéãããã¼ã¿ã»ããã§ãã
> library(lda) > data(newsgroup.train.documents) > data(newsgroup.vocab) # æåã®è¨äºã®word frequencyãé©å½ã«headãã¦ã¿ã > head(as.data.frame(cbind(newsgroup.vocab[newsgroup.train.documents[[1]][1, ]+1],newsgroup.train.documents[[1]][2, ])),n=10) V1 V2 1 archive 4 2 name 2 3 atheism 10 4 resources 4 5 alt 2 6 last 1 7 modified 1 8 december 1 9 version 3 10 atheist 9 # ãããã¯ã¢ãã«ãæ¨å®ãã > K <- 20 > result <- lda.collapsed.gibbs.sampler(newsgroup.train.documents, K, newsgroup.vocab, 25, 0.1, 0.1, compute.log.likelihood=TRUE) # åã ã®ãããã¯ãæ§æããåèªã3åãªãã20åæãã¦ã¿ã > top.words3 <- top.topic.words(result$topics, 3, by.score=TRUE) > top.words20 <- top.topic.words(result$topics, 20, by.score=TRUE) > top.words3 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] "he" "god" "space" "drive" "that" "that" "window" "it" "that" "com" "he" "windows" "the" [2,] "they" "that" "the" "scsi" "it" "was" "file" "car" "israel" "medical" "team" "software" "of" [3,] "was" "jesus" "of" "mb" "mr" "of" "db" "you" "not" "was" "game" "graphics" "and" [,14] [,15] [,16] [,17] [,18] [,19] [,20] [1,] "key" "god" "you" "you" "edu" "space" "that" [2,] "encryption" "that" "that" "it" "com" "nasa" "we" [3,] "chip" "of" "gun" "they" "writes" "for" "you" > top.words20 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] "he" "god" "space" "drive" "that" "that" "window" "it" [2,] "they" "that" "the" "scsi" "it" "was" "file" "car" [3,] "was" "jesus" "of" "mb" "mr" "of" "db" "you" [4,] "were" "he" "nasa" "card" "stephanopoulos" "not" "server" "my" [5,] "she" "the" "and" "disk" "president" "it" "motif" "that" [6,] "had" "of" "launch" "output" "you" "as" "widget" "use" [7,] "and" "is" "satellite" "file" "we" "you" "mit" "or" [8,] "we" "not" "center" "controller" "he" "writes" "sun" "com" [9,] "that" "we" "orbit" "entry" "government" "drugs" "uk" "driver" [10,] "her" "his" "lunar" "drives" "is" "are" "com" "engine" [11,] "armenians" "you" "in" "ide" "not" "were" "edu" "cars" [12,] "his" "bible" "gov" "mhz" "this" "article" "display" "if" [13,] "turkish" "people" "earth" "bus" "jobs" "who" "set" "wiring" [14,] "the" "christian" "by" "system" "what" "in" "application" "get" [15,] "there" "armenian" "to" "memory" "com" "about" "windows" "but" [16,] "said" "it" "spacecraft" "mac" "clinton" "sex" "code" "oil" [17,] "armenian" "who" "mission" "dos" "if" "greek" "lib" "me" [18,] "him" "christ" "south" "if" "believe" "people" "cs" "can" [19,] "me" "turkish" "on" "windows" "think" "is" "tar" "up" [20,] "went" "are" "mars" "pc" "people" "livesey" "xterm" "speed" [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [1,] "that" "com" "he" "windows" "the" "key" "god" "you" [2,] "israel" "medical" "team" "software" "of" "encryption" "that" "that" [3,] "not" "was" "game" "graphics" "and" "chip" "of" "gun" [4,] "you" "in" "season" "image" "government" "clipper" "is" "the" [5,] "israeli" "disease" "games" "dos" "israel" "keys" "it" "he" [6,] "to" "msg" "hockey" "ftp" "jews" "to" "not" "guns" [7,] "of" "aids" "play" "files" "militia" "government" "we" "we" [8,] "is" "patients" "players" "version" "states" "privacy" "to" "they" [9,] "people" "article" "year" "file" "their" "security" "you" "not" [10,] "jews" "writes" "his" "available" "by" "escrow" "believe" "have" [11,] "who" "use" "league" "pub" "united" "secure" "jesus" "is" [12,] "are" "food" "teams" "mail" "state" "nsa" "evidence" "if" [13,] "their" "it" "baseball" "information" "congress" "des" "he" "your" [14,] "they" "apr" "win" "pc" "turkey" "law" "there" "people" [15,] "war" "hiv" "nhl" "system" "amendment" "the" "christians" "weapons" [16,] "it" "university" "player" "anonymous" "um" "system" "his" "it" [17,] "we" "edu" "was" "for" "law" "will" "christian" "are" [18,] "arab" "had" "vs" "mac" "mr" "public" "they" "of" [19,] "human" "health" "pts" "data" "arms" "of" "do" "was" [20,] "were" "my" "division" "internet" "turks" "pgp" "atheism" "do" [,17] [,18] [,19] [,20] [1,] "you" "edu" "space" "that" [2,] "it" "com" "nasa" "we" [3,] "they" "writes" "for" "you" [4,] "is" "article" "program" "is" [5,] "that" "it" "edu" "to" [6,] "writes" "my" "the" "it" [7,] "edu" "apr" "email" "they" [8,] "don" "you" "flight" "do" [9,] "uiuc" "bike" "system" "not" [10,] "think" "car" "moon" "my" [11,] "your" "cs" "henry" "he" [12,] "article" "dod" "research" "have" [13,] "not" "ca" "engine" "people" [14,] "my" "pitt" "model" "what" [15,] "com" "cars" "sale" "church" [16,] "would" "too" "send" "be" [17,] "me" "like" "shuttle" "think" [18,] "if" "ride" "you" "can" [19,] "do" "ac" "entries" "there" [20,] "cso" "uucp" "looking" "if" # æåã®5ã¤ã®è¨äºã ããã§ãã¯ãã¦ã¿ã > N <- 5 > topic.proportions <- t(result$document_sums) / colSums(result$document_sums) > topic.proportions <- topic.proportions[1:N, ] > topic.proportions[is.na(topic.proportions)] <- 1 / K > colnames(topic.proportions) <- apply(top.words3, 2, paste, collapse=" ") > > par(mar=c(5, 14, 2, 2)) > barplot(topic.proportions, beside=TRUE, horiz=TRUE, las=1, xlab="proportion")
æåã®5ã¤ã®ãã¥ã¼ã¹è¨äºã®ãããã¯ç¢ºçããããããã¦ã¿ããã®ã§ãããå¹¾ã¤ãã®ãããã¯ã¯ç¹å®ã®è¨äºããå«ãã§ããªãã¨ãè²ã
è¦ã¦åãã¾ãããã¡ãªã¿ã«'newsgroups'ãã¼ã¿ã»ããã¯train / testã«åããã¦ããã®ã§ãtrainã§ã¢ãã«ãä½ã£ã¦testã§è©¦ããã¨ãã§ãã¾ãã
å ååè§£ï¼SVDã»NMFãªã©ï¼
ããã¾ãå®ã¯åèªèº«ãå®åã§ã¯ä¸åº¦ããã£ã¦ããªããã®ã®ï¼æ»æ±ï¼ã以åã®ç¾å ´ã§ã¯ãã¼ã ã¡ã³ãã¼ã«ãã£ã¦ããã£ã¦ããï¼é »ç¹ã«è«æè¼ªèªä¼ã§ãéè¦ãã¼ãã¨ãã¦åãä¸ãããã¦ããã®ã§é常ã«é¦´æã¿æ·±ãææ³ã®ä¸ã¤ã§ããç¾ä»£ã®ã¬ã³ã¡ã³ãã¼ã·ã§ã³æè¡ã®åºç¤ã¨ãªã£ã¦ããã®ãå åå解系ã®è«¸ææ³ã¨ãããã¨ãããã好ãã§åãä¸ãããã¦ããã¨ããå´é¢ããããã¨æãã¾ãã
æ¬è³ªçã«ã¯ã次å 忏ãå³ã¡å ã®ãã¼ã¿ããè¦ããªãè¦ç´ ãåãè½ã¨ãã¨ããããã ããã ãã§ããããããªãããããã«ãã£ã¦çãª(sparse)ãã¼ã¿ããã§ãã¬ã³ã¡ã³ãã¼ã·ã§ã³ã®è¨ç®ããããããªãã¨ããã¡ãªãããå¾ããã¾ãããã®è¾ºã¯å ¨é¨æ¸ãã¦ããã ãã§å¤§å¤ï¼åã®åå¼·ãä¸è¶³ãã¦ããã®ã§ã代ããã«ãã¤ããä¸è©±ã«ãªã£ã¦ãããäºæ¹ãid:SAMãã並ã³ã«id:a_bickyããã®ããã°è¨äºãå¼ç¨ãã¦ããã¾ãã
ã¨ãããã¨ã§ãGitHubãªãã¸ããªã«ä¸ãã¦ãã購買ã·ãã¥ã¬ã¼ã·ã§ã³ãã¼ã¿ã«å¯¾ãã¦åããã¨ããã£ã¦ã¿ã¾ããããã¶ã£ã¡ãããäºæ¹ã®å®è·µä¾ããã ãªãã£ã¦ããã ããªã®ã§ä½ä¸ã¤ãªãªã¸ãã«ã§ã¯ããã¾ããããããªããï¼æ±ï¼ã
# æè9ç« ã®è³¼è²·ã·ãã¥ã¬ã¼ã·ã§ã³ãã¼ã¿ > M<-read.csv('https://github.com/ozt-ca/tjo.hatenablog.samples/raw/master/r_samples/public_lib/DM_sampledata/ch9_2.txt',header=T,sep=' ') # SVDã§rankã4ã¾ã§åã£ãä¾ > res.svd <- svd(M) # SVD > u <- res.svd$u > v <- res.svd$v > d <- diag(res.svd$d) > d_r <- d > for (i in 5:11) { + d_r[i, i] <- 0 + } > R_svd <- as.matrix(M) %*% v %*% solve(d) %*% d_r %*% t(v) > colnames(R_svd) <- colnames(M) > head(round(R_svd, 2)) # ã¬ã³ã¡ã³ãã¼ã·ã§ã³ç®åºçµæ book cosmetics electronics food imported liquor magazine sake stationery toy travel [1,] 0.74 1.21 0.15 0.36 0.24 1.00 0.82 0.14 0.42 1.07 0.19 [2,] 0.70 0.07 0.27 0.64 0.91 -0.02 -0.06 -0.03 0.63 0.02 0.23 [3,] 1.12 0.71 0.34 0.73 0.14 0.95 0.40 1.04 0.72 1.04 0.38 [4,] 1.40 0.68 0.45 1.01 0.72 0.92 0.33 0.79 1.00 1.02 0.47 [5,] 0.81 0.06 0.27 0.52 -0.03 0.96 -0.05 0.99 0.51 1.02 0.30 [6,] 1.13 1.17 0.35 0.87 0.20 -0.05 0.76 1.12 0.78 0.03 0.40 # NMFã§rankã4ã¾ã§åã£ãä¾ > library(NMF) > res.nmf <- nmf(M, 4, seed=1234) # NMF > w <- basis(res.nmf) > h <- coef(res.nmf) > h_z <- rbind(h, rep(0, 11)) > R_nmf <- w %*% h > head(round(R_nmf, 2)) # ã¬ã³ã¡ã³ãã¼ã·ã§ã³ç®åºçµæ book cosmetics electronics food imported liquor magazine sake stationery toy travel [1,] 0.81 1.36 0.00 0.52 0.00 0.97 0.68 0.00 0.54 1.13 0.00 [2,] 0.64 0.00 0.00 0.48 0.88 0.00 0.00 0.00 0.47 0.00 0.52 [3,] 1.07 0.75 0.47 0.70 0.00 1.00 0.37 0.76 0.71 1.17 0.00 [4,] 1.57 0.61 0.33 1.10 1.02 0.85 0.30 0.53 1.10 0.99 0.61 [5,] 0.78 0.00 0.44 0.50 0.00 0.95 0.00 0.70 0.52 1.11 0.00 [6,] 1.19 1.38 0.81 0.85 0.00 0.00 0.68 1.30 0.79 0.00 0.00
SVDã¨NMFã¨ã§æåã®6ã¦ã¼ã¶ã¼ã«å¯¾ããã¬ã³ã¡ã³ãã¼ã·ã§ã³ã®çµæãå¾®å¦ã«ç°ãªãï¼åè
ã¯è² ã®å¤ãå«ãã§ãããå¾è
ã¯éè² ã®ã¿ï¼ãã¨ããåãããã¨æãã¾ãããã¡ãããå®åã®ç¾å ´ã§ã®ã¬ã³ã¡ã³ãã¼ã·ã§ã³ã«ç¨ããããææ³ã¯å®ç¨ä¸ã®ãã¤ã³ãã®ã¿ãªããè¨ç®è² è·ãªã©ãèæ
®ããªããã°ãããªãããããã£ã¨è¤éã§ãã
çµ±è¨å¦ã»æ©æ¢°å¦ç¿ã®è«¸ææ³ã«ã¤ãã¦å¦ã¶ä¸ã§ç¢ºèªãã¦ãããããã¤ã³ã
以åã®è¨äºã§ãæ¸ãã¾ããããçµ±è¨å¦ã¨æ©æ¢°å¦ç¿ã«ã¤ãã¦ã¯å人çã«ã¯ä»¥ä¸ã®ããã«åºå¥ã§ããã¨æã£ã¦ãã¾ãã
- çµ±è¨å¦ã¯ã説æããããã®
- æ©æ¢°å¦ç¿ã¯ãäºæ¸¬ããããã®
åºæ¬çã«çµ±è¨å¦çãªè¦³ç¹ããå帰ã¢ãã«ãç¨ããéã¯ãä¾ãã°ãã©ã¡ã¼ã¿ã®å¤§å°ã¨ãã£ãã説æãçãªè¦ç´ ãéè¦ãããã¨ãå¤ãããã§ãã仿¹ãæ©æ¢°å¦ç¿çãªè¦³ç¹ããã¯ã¢ãã«æªç¥ãã¼ã¿ï¼ãã¹ããã¼ã¿ï¼ã®ãäºæ¸¬ããéè¦ãããã¨ãå¤ãã§ãã
ããèããå ´åãèªãã¨ã¢ãã«ã®æ§è½è©ä¾¡ãããææ¨ã¯å¤ãã£ã¦ãã¾ããçµ±è¨å¦çãªè¦³ç¹ããã¯AICã®ãããªéçãªææ¨ã§ã¢ãã«ãè©ä¾¡ãããã¨ãå¤ã䏿¹ã§ãæ©æ¢°å¦ç¿çãªè¦³ç¹ããã¯ãã¯ã交差æ¤è¨¼ã§ã®æ§è½ã§ã¢ãã«ãè©ä¾¡ãããã¨ãå¤ãç¹ã«ãæ±åæ§è½ãé«ããã©ããããæ±ãããã¾ããã説æãã¨ãäºæ¸¬ãã®ã©ã¡ãã«éããç½®ãããéè¦ã ã¨è¦ãã¦ããã¾ãããã
ã¾ããã¢ãã«æ§è½ã«ä½ãå½±é¿ããããã大äºãªãã¤ã³ãã§ããä¸è¬ã«ãçµ±è¨å¦çãªå´é¢ãå¼·ãã¢ãã«ï¼ç·å½¢å帰ã¢ãã«ãä¸è¬åç·å½¢ã¢ãã«ãªã©ï¼ã¯èª¬æå¤æ°ã®åæ¨é¸æãã¢ãã«æ§è½ã«å½±é¿ãä¸ãã䏿¹ãæ©æ¢°å¦ç¿çãªå´é¢ãå¼·ãã¢ãã«ï¼ã©ã³ãã ãã©ã¬ã¹ããDeep Learningãªã©ï¼ã¯ããã«å ãã¦ã¢ãã«ãæ§é çã«æã¤ãã©ã¡ã¼ã¿*15ã®é¸ã³æ¹ã大ããå½±é¿ãããã¨ãããã¾ãããã®ããå¾è ã¯ã°ãªãããµã¼ãã®ãããªããã¿æ½°ãçãªæ¹æ³ã§ãã¹ãã®ãã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ãè¦ã¤ããå¿ è¦ããã£ãããã¾ãã
ãã®ããã«ãé¸ã¶ææ³æ¬¡ç¬¬ã§ã¢ãã«ã®æ§è½ãåä¸ãããããã®åãçµã¿æ¹ã¾ã§ããç°ãªãã¨ããç¹ã«ããã¸ãã¹å®åã®ç¾å ´ã§ç¨ããéã¯æ³¨æãå¿
è¦ã§ããéã«è¨ãã°ããããã¡ããã¨åºæ¥ã¦ããã°ããã¾ãé£ããææ³ã使ããªãã¦ãååã«å®ç¨ã«èããçµæãåºããã¨ãããã¨ãè¨ãã¾ãï¼å人çãªçµé¨ããè¨ãã°ï¼ã
æå¾ã«
ããã¤ãååã®10é¸ããæ´©ãããã®ãããã¾ãããä»£è¡¨ãæ±ºå®æ¨ã¨SVMãåè
ã¯å¼±å¦ç¿å¨ã¨ãã¦ã¯ä»ã§ãéå®ããã¾ãããåä½ã¨ãã¦ã¿ãå ´åã«ã精度ã¯ç¾ä»£ã®åé¡å¨ã«æ¯ã¹ã¦æ ¼æ®µã«å£ãã䏿¹ã§ãæã£ã以ä¸ã«ã¢ãããã¯åæç¨éã¨ãã¦ãçµæã®è§£éãé£ããããã¨ãããã¦Rã®å ´å便å©ã ã£ã{mvpart}ããã±ã¼ã¸ãCRANããåé¤ããã¦ãã¾ã£ã¦ã¤ã³ã¹ãã¼ã«ãé¢åèããªã£ã¦ãã¾ã£ãã®ã§ãä»åã¯å¤ãã¾ããã
ããã¦SVMã¯ãã®ä»ã®åé¡å¨ãåèããä¸ã«ãã£ã¦ãæ±åæ§è½ã«åªããã以å¤ã®ã¡ãªããããã¾ã大ãããªãã®ã¨ãã©ã¡ããã¨ããã¨å é¨ã®ã¢ã«ã´ãªãºã ã®å°åºï¼å®è£ ãåå¼·ããæ¹ãå¦ã³ã大ããã¨ãã代ç©ãªã®ã§ãä»åã®ãããªãã¾ãã¯ä½¿ãæ¹ãè¦ãã¾ããããçè¨äºã«ã¯ãã¾ããã£ããããªãã¨ãããã¨ã§å¤ãã¦ããã¾ãã
å¾ã¯ãã¢ã½ã·ã¨ã¼ã·ã§ã³åæã«ã¤ãã¦ã¯ç¾å ´ã§ä½¿ããããã¨èªä½ãæ¸ã£ã¦ããï¼ã¬ã³ã¡ã³ãã¼ã·ã§ã³ãªãSVD/NMFç³»ã®æ¹ãå¼·ãï¼ãã¨ããã岿ãè¨éæç³»ååæãããã¸ãã¹å®åã ã¨å ç夿°ãã¡ã¤ã³ã®ãã¼ã¿ã»ãããæ±ãã±ã¼ã¹ãï¼éèãªã©åéãéãã°å¤ç¨ãããã¨ã¯è¨ãã©ãï¼å²ã¨éãããã®ã§ããã¤ã¸ã¢ã³ã¢ããªã³ã°å´ã§å¯¾å¿ãã¦ãããã¨ãããã¨ã§ä»åã¯å¤ãã¾ããã
ä»åæãã12åã®ææ³ã«ã¤ãã¦ããæ·±ãå¦ã¶ä¸ã§ãè¦ãã®æ¸ç±ã¯ã以ä¸ã®éå»è¨äºã«ã¾ã¨ãã¦ããã¾ãã®ã§ãããããã°ãã¡ããã©ããã
ãã¡ãããã®5åï¼12åã ãã§ã¯ä¸è¨ã®12ææ³å ¨ã¦ãåäºåã«çè§£ããã«ã¯è¶³ããªãã®ã§ãé©å®ãã£ã¨ä½ç³»çã«ã¾ã¨ã¾ã£ãæ¸ç±ã«çãããèªèº«ã§å½ãããããã¨ããè¦ããã¾ããã¨ããããåèªèº«ããã®5åï¼12åãè¶ ãã¦è²ã å¥ã«è¯ãããã¹ãæ¢ãåºãã¦ãã¦èªãã§ããã£ã¨åå¼·ããã°ããã
ã¨ãããã¨ã§ã3å¹´éã®åéå
¨ä½ã®é²æ©ãè¸ã¾ãã¦æ¹ãã¦ã10+2é¸ããåãä¸ãã¦ã¿ã¾ãããæ¬¡ã«æ¸ãã®ã¯ã¾ã3å¹´å¾ã§ãããï¼ï¼ç¬ï¼
追è¨
è±èªçæ¸ãã¾ããã
*1:Stanãxgboostã®ããã«gcc / clangã³ã³ãã¤ã©ã鿥çã«å¿ è¦ãªãã®ãã¯ãã¾ãH2Oã®ããã«Javaã鿥çã«å¿ è¦ãªãã®ãããã¾ã
*2:仮説æ¤å®ã®æ çµã¿ã¯é常ã«ãããããã¦ãå®ã¯ãã®å ´åã§ãDB1ã®æ¹ãããé«éã§ããã¨ãçµè«ä»ããããã¨ã妥å½ãã©ããã«ã¯è¤éãªè°è«ããã£ãããã¾ãï¼ç¹ã«å¹æéã¨ãµã³ãã«ãµã¤ãºã¯ãã¾ãfile drawer problemãªã©ã絡ãã¨ï¼ãã¨ã¯è¨ããçµ±è¨å¦ã®ãã¦ã¼ã¶ã¼ãã®ç«å ´ã¨ãã¦ã¯ã²ã¨ã¾ãææå·®ããï¼çµè«ãåºãã¨ã¿ãªãã¦ã大æµã®å ´åã¯åé¡ãªãã§ã
*3:ã¶ã£ã¡ããä¾ãå¾®å¦ã§ããããä»¥ä¸æãæµ®ãã°ãªãã£ããã§ãããããªããããããªããããããªãã
*4:ãã¡ããã¨ã³ã¸ãã¢ãµã¤ãã§æ©æ¢°å¦ç¿ã·ã¹ãã ã®å®è£ ãããéã«ã¯ãç°¡åãªãã¨ããã£ã¦æã£ã以ä¸ã«å¤ç¨ããããã¨ã®å¤ãææ³ã§ãã
*5:ã¤ã¾ãã«ã¦ã³ããã¼ã¿ã§ãã£ã¦ãé£ç¶ãã¼ã¿ã§ã¯ãªãç¹ã«æ³¨æãå¿ è¦
*6:å ·ä½çã«ã¯ããã®ãããªã±ã¼ã¹ã§ã¯ç·ãµã¤ã訪åè æ°ãããªãã»ããé ãã¨ãã¦ã¢ãã«ã«çµã¿è¾¼ãå¿ è¦ãããã¾ã
*7:ãã®æ¹æ³è«ã¯holdout, leave-one-out, k-foldsãªã©è²ã ããã¾ã
*8:set.seed(71)ããã¨è¯ãããï¼ç¬
*9:è±èªããã°ã§ãH2Oã®ãã¼ã¸ã§ã³ãéã以å¤ã¯å ¨ã¦å®å ¨ã«åãè¨å®ã§è©¦ãããã¨ããããã§ããããããããæ£ççã¯ä½ãåºã¦ãã¾ãããã¼ã¸ã§ã³ã¢ããããã¤ãã§ã«ä½ãå é¨ã夿´ãããã§ãããããã
*10:è¦ã¯ããããReLUã§ãã
*11:RãPythonã§ç°¡åã«æã«ã¯ããããã±ã¼ã¸ã»ã©ã¤ãã©ãªã§æ±ããç¯å²ã¨ããæå³ã§
*12:Wardæ³ä»¥ä¸é層çã¯ã©ã¹ã¿ãªã³ã°ããEMã¢ã«ã´ãªãºã ï¼æ··åã¢ãã«ã¨è¨ã£ããããã¯å²æãã¾ã
*13:ãã空æã¯ã©ãã«ããã人éé¢ä¿ãç¡åã°ã©ãã¨ãã¦è¨é²ãããã®
*14:ããã£ã¦ãã®fjã¨ããå«ãã§ãããã¥ã¼ã¹ã°ã«ã¼ããã§ãããï¼
*15:Deep Learningãªãä¸éå±¤ã®æ°ãåã ã®å±¤ã®ã¦ãããæ°ãªã©