追è¨
2016å¹´3æã«ä»¥ä¸ã®è¨äºã«ãã£ã¦ãã®å 容ã¯updateããã¦ãã¾ããä»å¾ã¯ãã¡ãããèªã¿ä¸ããã
主ã«èªååãã®ã¾ã¨ãã¨ããæå³åããå¼·ããã§ããï¼ç¬ï¼ãåãå®éã«2013å¹´6æç¾å¨webãã¼ã¿åæï¼ãã¼ã¿ãµã¤ã¨ã³ã¹ã®å®åã§ãã¼ã«ã»ã©ã¤ãã©ãªã»ããã±ã¼ã¸ãå©ç¨ãã¦ãããã®ã«éã£ã¦ãçµ±è¨å¦ã»æ©æ¢°å¦ç¿ç³»ã®åæææ³ã10åæãã¦ç´¹ä»ãã¦ã¿ããã¨æãã¾ãã
- 追è¨
- å帰åæï¼ç¹ã«ç·å½¢éå帰åæï¼
- ç¬ç«æ§ã®æ¤å®ï¼ã«ã¤äºä¹æ¤å®ã»ãã£ãã·ã£ã¼ã®æ£ç¢ºç¢ºçæ¤å®ï¼
- 主æååæ(PCA) / å ååæ
- ã¯ã©ã¹ã¿ãªã³ã°
- æ±ºå®æ¨ / å帰æ¨
- ãµãã¼ããã¯ã¿ã¼ãã·ã³(SVM)
- ãã¸ã¹ãã£ãã¯å帰
- ã©ã³ãã ãã©ã¬ã¹ã
- ã¢ã½ã·ã¨ã¼ã·ã§ã³åæï¼ãã¹ã±ããåæã»ç¸é¢ã«ã¼ã«æ½åºï¼
- è¨éæç³»ååæ
- ãããã«
åºæ¬çã«ã¯ã©ããåãæ¦ç¥ãã¼ã±ãã£ã³ã°é¨éã§å®åã¨ãã¦è¡ã£ã¦ããã¢ãããã¯åæåãã®ææ³ã§ã
- åèªèº«ãç¥ã£ã¦ãã¦ãå®åã§ã¯ã»ã¨ãã©ä½¿ã£ã¦ããªããã®
- ã¬ã³ã¡ã³ããªã©ããã¯ã¨ã³ãã·ã¹ãã åãã®ææ³
- æ©æ¢°å¦ç¿ã®è«¸ææ³ã®ããã¯ã¨ã³ãã·ã¹ãã åãã®å®è£ æ¹æ³
- Deep learningã¨ãå··ã§ã¯æåã§ãå人çã«ã¯ã¾ã å®åã§ä½¿ã£ããã¨ã®ãªããã®
- ãã¤ã¸ã¢ã³ãªã©ããããåãä¸å¾æãªãã®
ãªã©ã¯å¤ãã¦ããã¾ããæªãããããäºæ¿ãããªããåã®ä»äºå 容ãå¤ãã度ã«ä»å¾ãã®ã·ãªã¼ãºã¯ã¢ãããã¼ãããã¦ããã»ã»ã»äºå®ã§ãããã¶ãï¼ç¬ï¼ã
ã¡ãªã¿ã«ãä»åãçµ±è¨å¦çã»æ©æ¢°å¦ç¿çãªå³å¯æ§ã¯ããç¨åº¦åº¦å¤è¦ãã¦ããã®ããã大ãã£ã±ãªèª¬æã«çããã¤ããã§ããç´°ãããã¤ã³ãã¯ã¾ãæ¹ãã¦ãã¨ãããã¨ã§ãããã¦ãã©ããR / SPSSãªã大ä½ä½¿ãããã®ã°ããã§ããåãã¦Rã使ãã¨ãã人ã¯ã以åã®è¨äºï¼素性ベクトル+分類ラベルのテーブルを持ってくる⇒Rを使ってお手軽に機械学習で分類してみるï¼ãªã©ãåèã«ãå®è¡ç°å¢ãæºåããä¸ã§ãã©ã¤ãã¦ã¿ã¦ä¸ããã
ï¼â»åºæ¬çã«ãã©ããªææ³ã使ã£ã¦ãããããã©ããããã¼ã«ã»ã©ã¤ãã©ãªã»ããã±ã¼ã¸ãå©ç¨ããã°ãã®ææ³ã使ããããã«ã®ã¿ãã©ã¼ã«ã¹ããè¨äºãªã®ã§ãå³å¯æ§ã«ããããé¨åã¯å
¨ã¦åº¦å¤è¦ãã¦ãã¾ããæªããããï¼
å帰åæï¼ç¹ã«ç·å½¢éå帰åæï¼
ççã«å¤§ãã£ã±ã«æ¸ãã¨ã
売ä¸é« = a * ããã¼ã¦ã¼ã¶ã¼DAU + b * ã©ã¤ãã¦ã¼ã¶ã¼DAU + c * å¼ã³æ»ãã¦ã¼ã¶ã¼DAU
ã®ããã«ä»®ã«æ°å¤ã¢ãã«ãç«ã¦ã¦ãå®ãã¼ã¿ããéç®ãã¦ããããã®ä¿æ°a, b, cãæ¨å®ãããã¨ã§ã¢ãã«ã®å ¨ä½åãæ±ããææ³ã®ãã¨ã§ãã主ã«DAUã¨ã売ä¸é«ã¨ããä½ãã¨ä½ããè¶³ãåããããããåããããã¨ã§å¾ãããã§ãããæ°å¤ãã®ã¢ãã«åã«åãã¦ãã¾ãã
Rã§ã¯ãããªæãã§å®è·µã§ãã¾ãããµã³ãã«ãã¼ã¿ã¨ãã¦ã¯airqualityã使ã£ã¦ã¾ããã¡ãªã¿ã«ããããããã¨ãããªæãã®ãã¼ã¿ã§*1ãé»ãããããã§è¡¨ããããªã¾ã³æ¿åº¦ã説æããã¢ãã«ãæ¢ãæãã§ãã
> data(airquality) # ãã¼ã¿èªã¿è¾¼ã¿ > airq<-airquality[,1:4] # æã»æ¥ä»ã®ãã¼ã¿ãå¤ã > airq.lm<-lm(Ozone~.,airq) # Ozone = a * Solar.R + b * Wind + c * Temp + dã®ã¢ãã«ãæ¨å®ãã > summary(airq.lm) # çµæã表示ãã Call: lm(formula = Ozone ~ ., data = airq) Residuals: Min 1Q Median 3Q Max -40.485 -14.219 -3.551 10.097 95.619 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -64.34208 23.05472 -2.791 0.00623 ** Solar.R 0.05982 0.02319 2.580 0.01124 * Wind -3.33359 0.65441 -5.094 1.52e-06 *** Temp 1.65209 0.25353 6.516 2.42e-09 *** --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Residual standard error: 21.18 on 107 degrees of freedom (42 observations deleted due to missingness) Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948 F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16 > airq.lm<-lm(Ozone~. - 1,airq) # åçdãé¤å¤ãã > summary(airq.lm) Call: lm(formula = Ozone ~ . - 1, data = airq) Residuals: Min 1Q Median 3Q Max -40.675 -15.446 -5.526 13.479 88.822 Coefficients: Estimate Std. Error t value Pr(>|t|) Solar.R 0.06306 0.02387 2.641 0.00948 ** Wind -4.59884 0.48653 -9.452 8.21e-16 *** Temp 0.98525 0.08739 11.275 < 2e-16 *** --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Residual standard error: 21.84 on 108 degrees of freedom (42 observations deleted due to missingness) Multiple R-squared: 0.8383, Adjusted R-squared: 0.8338 F-statistic: 186.7 on 3 and 108 DF, p-value: < 2.2e-16
ãªã¾ã³æ¿åº¦ã«ã¯æ°æ¸©ããã©ã¹ã®å½±é¿ãã風éããã¤ãã¹ã®å½±é¿ãä¸ãã¦ãããã¨ãåããã¾ãããã¡ãªã¿ã«äº¤äºä½ç¨ã¨ãã¢ãã«é¸æã®åé¡*2ã¨ãç´°ããç¹ã¯ãããããããªãã®ã§ããã§ã¯å²æãã¾ãä»åã¯ããã¾ããã§ããããpredict()颿°ã§ã¢ãã«ã«åºã¥ããäºæ¸¬ãè¡ããã¨ãã§ãã¾ãã
ãªããRã使ã人ã¯ãã®å帰åæã®ã¨ããã§åºã¦ãã
y ~ x1 + x2 + x3 + ... # å帰ã¢ãã« y ~ . # å帰ã¢ãã«ï¼å ¨é¨å ¥ãï¼
ã®ãããªformulaè¨æ³ã«æ
£ãã¦ããã¨è¯ãã¨æãã¾ããå帰ã¢ãã«åã³formulaè¨æ³ã¯ãã®ä»ã®éè¦ãªç·å½¢æ¤å®ã¢ãã«ï¼ä¾ãã°åæ£åæãªã©ï¼ã§ã使ããã¨ã«ãªãã®ã§ãè¦ãã¦ããã¦æã¯ãªãã§ãã
ç¬ç«æ§ã®æ¤å®ï¼ã«ã¤äºä¹æ¤å®ã»ãã£ãã·ã£ã¼ã®æ£ç¢ºç¢ºçæ¤å®ï¼
ä½ãæ½çãæã£ãéã®KPIã«å¯¾ãã广æ¤è¨¼ãè¡ãéã«ã¯å¿
é ã§ããç¹ã«A/Bãã¹ãã§æ¹åæ½çãã³ã³ãã¼ã¸ã§ã³UUæ°ãå¢ããããã©ãã調ã¹ããï¼ã¨ããæã«ã¯ãCVRã ãè¦ã¦ãã¦ãå¾ã
ã«ãã¦åæ¯ãéã£ã¦ãã¦ãã®ã¾ã¾ã§ã¯æ¯è¼ã§ããªãã±ã¼ã¹ãå¤ãã®ã§ããã®æ¹æ³è«ãç¥ã£ã¦ãããã¨ã¯éè¦ã§ãã
ï¼ä»¥åã®ããã°è¨äºãåç §ã®ãã¨ï¼「カイゼンしたらコンバージョン率が○○%→△△%にup!」は分母を無視したら成り立たないかもしれないï¼
Rã§ã¯chisq.test()颿°ãfisher.test()颿°ã§å®è·µã§ãã¾ãããµã³ãã«ã¨ãã¦ãæããããç¥ããã¦ãããäºé²æ³¨å°ã®å¹æã®æç¡ãã®ãã¼ã¿ãç¨æãã¦ã¿ã¾ããã
ç æ°ã«ããããªã | ããã£ã | |
---|---|---|
注å°ãã | 1625 | 5 |
注å°ããªã | 1022 | 11 |
> x<-matrix(c(1625,5,1022,11),ncol=2,byrow=T) # ãã¼ã¿ããããªã¯ã¹ã¨ãã¦ä¸ãã > print(x) # ç¢ºèª [,1] [,2] [1,] 1625 5 [2,] 1022 11 > chisq.test(x) # ã«ã¤äºä¹æ¤å® Pearson's Chi-squared test with Yates' continuity correction data: x X-squared = 4.8817, df = 1, p-value = 0.02714 # ææï¼äºé²æ³¨å°ã«ã¯å¹æããã > fisher.test(x) # ãã£ãã·ã£ã¼ã®æ£ç¢ºç¢ºçæ¤å® Fisher's Exact Test for Count Data data: x p-value = 0.01885 # ææï¼äºé²æ³¨å°ã«ã¯å¹æããã alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 1.115982 12.879160 sample estimates: odds ratio 3.496373
ãäºé²æ³¨å°ã«ã¯å¹æããã£ããã¨ããçµè«ã«ãªã£ã¦ãã¾ããåãããã«å¾®å¦ãªã±ã¼ã¹ã£ã¦çµæ§A/Bãã¹ãã§ã¯å¤ãã¨æãã®ã§ãç¥ã£ã¦ããã¦æã¯ãªãã§ãããã
主æååæ(PCA) / å ååæ
ãã¼ã¿ããã¡ããã¡ããã¦ãã¦ãããç¨åº¦ã©ãããæ¹åæ§ã«ãã¼ã¿ãå²ãã¦ãããçµãè¾¼ã¿ããï¼ã¨ããæã«ä½¿ããææ³ã§ãããã®2ã¤ãè¯ããã£ããã ã¨è¨ããããã§ãã大ã¾ãã«è¨ãã°
- ã¢ãã«ãªãã§ãå¤ãã®å¤æ°ãå°ãªã夿°ã«éç´ããã®ã主æååæ
- ã¢ãã«ããã§ãå¤ãã®å¤æ°ãå ±éå åã«ã¾ã¨ããã®ãå ååæ
ã¨ãã£ãéããããã¾ããã¨ããããå ¨ä½ã®å¾åã¨ãã¦ãã¼ã¿ãã©ã®æ¹åæ§ã«åãã£ã¦åå¸ãã¦ããããç¥ãããæã«ã¯ã©ã¡ãã®ææ³ãéå¸¸ã«æç¨ã§ãã
主æååæã®ä¾ã¨ãã¦ãRã«ããã©ã«ãã§å
¥ã£ã¦ãããµã³ãã«ãã¼ã¿USArrestsãç¨ãã¦ã¿ã¾ããç©é¨ãªå¤æ°åã並ãã§ã¾ãããããã¯1973å¹´ã®å
¨ç±³50å·ã§ã®ä¸»è¦ãªç¯ç½ªã«ãã£ã¦é®æããã容çè
ã®æ°ã10ä¸äººãã¨ã®æ°åã«ãã¦è¡¨ãããã®ã§ãã
> data(USArrests) > pc.cr<-princomp(USArrests,cor=T) # princomp()ã主æååæãè¡ã颿° > biplot(pc.cr)
ç©é¨ãªå·ã¯ã©ã®è¾ºãï¼ããã®ããããããåãã£ã¦ãã¾ãã¾ãã*3ããã
䏿¹ãå ååæã®ä¾ã¨ãã¦ã¯こちらのページã®ãµã³ãã«ãæåãããã¨ã«ãã¾ãããã妿 ¡ã§å¦çã«ãã©ã®æç§ã好ãorå«ãï¼ãã5段éã§çãã¦ããã£ããã¼ã¿ã ããã§ãã
> data <- read.csv("dataset_exploratoryFactorAnalysis.csv") > data.fac<-factanal(data,factors=3,scores="regression") # factanal()颿°ã§å ååæãå åæ°ã3ã«è¨å® > biplot(data.fac$scores,data.fac$loadings)
æç§ã®å¥½ãå«ãã2系統ã«åã£ã¦ãããã¨ãè¦ã¦åãã¾ãããã¡ãªã¿ã«ãåããã¼ã¿ã«å¯¾ãã¦ä¸»æååæãè¡ã£ã¦ãã»ã¼åæ§ã®çµæã«ãªãã¾ãã
ã¯ã©ã¹ã¿ãªã³ã°
大ãã£ã±ã«è¨ãã°ãããã¼ã¿ã®çµã¿åãããä¼¼ããã®å士ãã¾ã¨ãããåææ¹æ³ã§ããã¤ã¡ã¼ã¸ã¨ãã¦ã¯ããã²ã¼ã Aã¨ã²ã¼ã Bããã£ã¦ãã人ãã¡ãvs.ãã²ã¼ã Cã¨ã²ã¼ã Dããã£ã¦ãã人ãã¡ãã®ããã«ãå©ç¨ãã¦ãããµã¼ãã¹ã®çµã¿åãããã¨ã«ã°ã«ã¼ãã³ã°ã§ããããããªããï¼ã¨ããã±ã¼ã¹ã§ããããå®éã«UUãã¼ã¹ã§åãåããæ¹æ³è«ã¨è¨ã£ã¦è¯ãã§ãããã
ãã¼ã¿ãå°ãããã°ãé層çã¯ã©ã¹ã¿ãªã³ã°ã¨ããææ³ã使ãã¾ããä¸ã®ãã©ã®æç§ã好ãorå«ãããã¼ã¿ããã®ã¾ã¾ä½¿ãã¨ããããªæãã«ãªãã¾ãã
> data <- read.csv("dataset_exploratoryFactorAnalysis.csv") > data.d<-dist(data) # åã ã®ãã¼ã¿éã®ã¦ã¼ã¯ãªããè·é¢ãæ±ãã > data.cls<-hclust(data.d) # hclust()ãé層çã¯ã©ã¹ã¿ãªã³ã°ã®é¢æ° > plot(data.cls)
ããã¾ãããå³ã®ã¿ã¤ãã«ãã¯ã¿åºãã¾ããï¼ç¬ï¼ããããªæãã®ãã³ããã°ã©ã ï¼æ¨¹ç¶å³ï¼ã§ã©ã®æç§ã好ãorå«ãï¼ãã¨ã«ã°ã«ã¼ããåããã¦ããã®ãè¦ã¦åãã¾ãã
ãªãããã¼ã¿ãµã¤ãºã大ããæã¯hclust()颿°ã§ã¯ãã°ããªããã¨ãå¤ãã®ã§ãkå¹³åã¯ã©ã¹ã¿ãªã³ã°ãè¡ãkmeans()颿°ã使ã£ãæ¹ãç¡é£ã§ãããã ããã³ããã°ã©ã ã表示ãããã¨ã¯åºæ¬çã«ã¯ã§ãããåã
ã®ãã¼ã¿ã«ã©ã®ã¯ã©ã¹ã¿ã«å²ãæ¯ããããã示ãã¤ã³ããã¯ã¹ã ããã¤ããã¨ããæãã§ãã
æ±ºå®æ¨ / å帰æ¨
å®éã«UUãã¼ã¹ã§ã®webãã¼ã¿åæã§ã¯ããããä¸çªäººæ°ãããã¨æãã¾ããè¦ããã«ãä¾ãã°ãç¿æå®çoré¢è±ãããã¨è¨ã£ãåé¡ã©ãã«ï¼ãã©ã®è¡åãå½è©²æéå
ã«ã¨ã£ãããã¨è¨ã£ãç´ æ§ãã¯ãã«ã«åºã¥ãã¦ãä½ãå®çoré¢è±ã¨ãåãããï¼ã¬ãã£ãå¼ãããä»ã¦ã¼ã¶ã¼ã¨ã¤ãªãã£ãetc.ï¼ããæ¨¹ç¶å³ã®å½¢ã§è¡¨ãææ³ã§ã*4ã
ããã¯ãã¼ã¿ã®è¡¨ç¤ºæ¹æ³ãç´æçã§åããããããããå¤ãã®webãã¼ã¿åæã®ç¾å ´ã§ä½¿ããã¦ãã¾ããä¸ã«ã¯å ¨èªååãã¦èªåã§ããã±ã¼ã¸åãã¦èª°ã§ãã¢ã¯ã»ã¹ã§ããããã«ãã¦ããã¨ãããããããã§ãã
Rã§ã¯ä»¥ä¸ã®ãããªæãã§ã§ãã¾ããããã¾ããã¾ã楽ãããªããã¼ã¿ã§ãããRã«ããã©ã«ãã§å
¥ã£ã¦ãããã¿ã¤ã¿ããã¯å·ä¹å®¢ä¹å¡ã®çåvs.æ»äº¡ç¶æ³ãæ§ã
ãªãã¼ã¿ã¨ã¨ãã«åé¡ããããã¼ã¿ãç¨ãã¦ãã¾ããããã§ã¯{mvpart}ããã±ã¼ã¸ã使ç¨ãã¾ãã
> data(Titanic) > z <- data.frame(Titanic) > Titanic1 <- data.frame(Class = rep(z[, 1], z[, 5]), Sex = rep(z[, 2], z[, 5]), + Age = rep(z[, 3], z[, 5]), Survived = rep(z[, 4], z[, 5])) > Titanic1.rp<-rpart(Survived~.,Titanic1) > plot(Titanic1.rp,uniform=T,margin=0.12) > text(Titanic1.rp,uniform=T,use.n=T,all=F)
è¨ç®ã®ä¾¿å®ä¸ã«ãã´ãªåããa, b, cã¨ãªã£ã¦ãã¾ã£ã¦ããã®ã§ããããããä½ã«å¯¾å¿ãã¦ãããã¯çãã¼ã¿ãè¦ãå¿ è¦ãããã¾ãããªãããã®ãã¼ã¿ããã¯ã女æ§ãããã¯åä¾ããããä¸çã®è¹å®¤ã®ä¹å®¢ãã»ã©çãæ®ããããã£ããã¨ããããã¤ãªãäºå®ãåããã¾ãã
ã¾ã{mvpart}以å¤ã«ãä¾ãã°{C50}ãªã©ãæ±ºå®æ¨ / å帰æ¨ã®Rããã±ã¼ã¸ã¯æ°å¤ãããã®ã§*5ãè²ã
試ãã¦ã¿ãã¨è¯ãã§ãããã
ãµãã¼ããã¯ã¿ã¼ãã·ã³(SVM)
è¨ããã¨ç¥ãããã¹ãã å¤å®ãªã©ã§éå®ãããéå¸¸ã«æåãªæ©æ¢°å¦ç¿åé¡å¨ã§ããä¸è¬ã«ã¯ã¹ãã ãã£ã«ã¿ãªã©ããã¯ã¨ã³ãã·ã¹ãã ã§ä½¿ããã®ã§ããããã®汎化性能ã®é«ããçããã¦ä¾ãã°ãå
æã®UUã仿å®çoré¢è±ãããå¦ãã¨ãããã¼ã¿ããã仿ã®UUãæ¥æå®çoré¢è±ãããã©ãããäºæ¸¬ããããªãã¦ãã¨ãã§ãã¾ãã
Rã«ã¯SVMãå®è£
ãã¦ããããã±ã¼ã¸ãããã¤ãããã¾ãããã¾ãæåã«{e1071}ããã±ã¼ã¸ãç´¹ä»ãã¾ããããã¯C++, Pythonãªã©ä»è¨èªã§æåãªSVMã©ã¤ãã©ãªã¨ãã¦ç¥ãããLIBSVMã®Rç§»æ¤çã§ãä»è¨èªã§ã®è¨ç®çµæã¨ã®æ´åæ§ãéè¦ãããªããã¡ãããã¿ã¼ããµã³ãã«ãã¼ã¿ã¯Rã§ã¯ãå®çªã®ããã£ãã·ã£ã¼ã®ã¢ã¤ã¡ã®ãã¼ã¿ãirisã§ãã3ã©ãã«ã§åé¡ãã¦ãã¾ãã
> data(iris) > attach(iris) > > ## classification mode > # default with factor response: > model <- svm(Species ~ ., data = iris) # SVMã¢ãã«æ¨å®ãRã§ã¯ããã§ããã > > # alternatively the traditional interface: > x <- subset(iris, select = -Species) > y <- Species > model <- svm(x, y) # ããã¯LIBSVMãªãªã¸ãã«ãæèããæ¸å¼ > > print(model) # ã¢ãã«ã®è©³ç´° Call: svm.default(x = x, y = y) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 51 > summary(model) Call: svm.default(x = x, y = y) Parameters: SVM-Type: C-classification SVM-Kernel: radial cost: 1 gamma: 0.25 Number of Support Vectors: 51 ( 8 22 21 ) Number of Classes: 3 Levels: setosa versicolor virginica > # test with train data > pred <- predict(model, x) > # (same as:) > pred <- fitted(model) > > # Check accuracy: > table(pred, y) y pred setosa versicolor virginica setosa 50 0 0 versicolor 0 48 2 virginica 0 2 48 > > # compute decision values and probabilities: > pred <- predict(model, x, decision.values = TRUE) > attr(pred, "decision.values")[1:4,] setosa/versicolor setosa/virginica versicolor/virginica 1 1.196203 1.091757 0.6708373 2 1.064664 1.056185 0.8482323 3 1.180892 1.074542 0.6438980 4 1.110746 1.053012 0.6781059 > > # visualize (classes by color, SV by crosses): æå¾ã«å³ç¤º > plot(cmdscale(dist(iris[,-5])), + col = as.integer(iris[,5]), + pch = c("o","+")[1:150 %in% model$index + 1])
ã¢ã¤ã¡ã®åé¨ä½ã®é·ãã®ãã¼ã¿ã«åºã¥ãã¦ãåé¡å¦ä¸ã®3種é¡ãã¨ã«ç¶ºéºã«ãã¼ã¿ãåãããããã¨ãè¦ã¦åãã¾ããã
䏿¹ãããã¾ãæåãª{kernlab}ããã±ã¼ã¸ã§ã¯ã以ä¸ã®ãããªæãã§å®è¡ã§ãã¾ããä¸ã®ä¾ã¨ã¯éã£ã¦ããã¡ãã¯2ã©ãã«ã§åé¡ãã¦ãã¾ãã
> data(iris) > attach(iris) > y<-as.matrix(iris[51:150,5]) > iris1<-data.frame(iris[51:150,3:4],y) > set.seed(0) > ir.ksvm<-ksvm(y~.,data=iris1) Using automatic sigma estimation (sigest) for RBF or laplace kernel > plot(ir.ksvm,data=iris1[,1:2])
{kernlab}ã®æ¹ãplot()颿°ã®ã«ã¹ã¿ãã¤ãºãåã£ã¦ãã¦è¦ãããããããã¾ãããã¡ãªã¿ã«ãå¤ãã£ãã¨ããã§ã¯{kernlab}ã«ã¯æåååé¡åãã®ææ³ãå®è£
ããã¦ãã¦ããããªæãã§è©¦ãã¾ãã
> data(reuters) > is(reuters) [1] "list" "vector" "input" "listI" "lpinput" "output" > tsv <- ksvm(reuters,rlabels,kernel="stringdot", + kpar=list(length=5),cross=3,C=10) > tsv Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 10 String kernel function. Type = spectrum Hyperparameters : sub-sequence/string length = 5 Normalized Number of Support Vectors : 39 Objective Function Value : -13.6834 Training error : 0 Cross validation error : 0.02381
ã©ã¡ãã®ããã±ã¼ã¸ã§ãã£ã¦ããpredict()颿°ãããã¯ããã«é¡ä¼¼ããæ çµã¿ã§ãå¦ç¿ã¢ãã«ã«åºã¥ãã¦äºæ¸¬ãããã¨ãå¯è½ã§ãã
SVMã¯å®è£
åãã©ã¤ãã©ãªã»ããã±ã¼ã¸ç¾¤ãé常ã«å
å®ãã¦ãã¦ãä¾ãã°C++ / Java / Pythonãªã©ã®è¨èªã«ã対å¿ããã©ã¤ãã©ãªãæ°å¤ãããã¾ãããããå®åçã«ã¯ãã¡ãã§å®è£
ãããã¨ã®æ¹ãå¤ããããããªãã§ãã
ãã¸ã¹ãã£ãã¯å帰
éç·å½¢å帰åæã®ä¸ç¨®ãªãã§ãããã0 or 1ã«å帰ãããããã¨ããäºå®ä¸æ©æ¢°å¦ç¿ã¨ãã¦æ±ããããã¨ãå¤ãã§ããå®éãã»ã¨ãã©SVMã¨åãããªã§ä½¿ãã±ã¼ã¹ãå°ãªããªãããã«æãã¾ãã
ä½¿ãæ¹ã¨ãã¦ã¯ãä¾ãã°ãåã ã®ã¦ã¼ã¶ã¼IDã«å¯¾ãã¦ç¿æå®çããã1 or ããªãã£ãã0ãã²ã¼ã Aããã¬ã¤ããã1 or ããªãã£ãã0ãã²ã¼ã Bãâ¦ãã¨ããæãã§ã«ãã´ãªã«ã«ãã¼ã¿ããæãç´ æ§ãã¯ãã«ãä½ãããããç¿æå®çããorããªããåé¡ã©ãã«ã¨ãããã¸ã¹ãã£ãã¯å帰ã«ããããã¨ã§ããã©ã®ã²ã¼ã ãUUã®ç¿æå®çã«è²¢ç®ãããï¼ããç®åºãããã¨ãã§ãã¾ãã
ã¨ãããã¨ã§Rã§ãã£ã¦ã¿ã¾ãããµã³ãã«ãã¼ã¿ã¯ã以前の記事で用いたtjo_uu_behavior.txtã§ãã
> rawData <- read.delim("tjo_uu_behavior.txt") > partData<-rawData[,2:8] # UserIDã«ã©ã ã¨Resultã©ãã«ãé¤å¤ãã > partData<-as.matrix(partData) # ãããªã¯ã¹å½¢å¼ã«ç´ã > idx<-which(is.na(partData)==T) # NAãå ¥ã£ã¦ãããããªã¯ã¹ã®ã¤ã³ããã¯ã¹ãæ±ãã > partData[idx]<-0 # NAãå ¥ã£ã¦ããã¤ã³ããã¯ã¹å ¨ã¦ã«0ãä»£å ¥ãã > partData<-as.data.frame(partData) # ãã¼ã¿ãã¬ã¼ã å½¢å¼ã«ç´ã > attach(rawData) # å ãã¼ã¿ã®åã«ã©ã ãå¼ã³åºãã¦ã¡ã¢ãªã«å ¥ãã > Data<-cbind(partData,Result) # UserIDã«ã©ã ãé¤å»ãã¦NAã0ã«ç´ãããã®ã¨Resultã©ãã«ããã£ã¤ãã > detach(rawData) # å ãã¼ã¿ãã¡ã¢ãªããå¤ã > Data.glm<-glm(Result~.,data=Data,family="binomial") # ãã¸ã¹ãã£ãã¯å帰ï¼"binomial"ãæå®ãã > summary(Data.glm) Call: glm(formula = Result ~ ., family = "binomial", data = Data) Deviance Residuals: Min 1Q Median 3Q Max -2.6771 -0.8263 -0.5952 0.2374 2.2293 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.89916 0.06209 -14.483 < 2e-16 *** post.view -0.74178 0.14061 -5.275 1.33e-07 *** post.submit 4.45451 0.51088 8.719 < 2e-16 *** photo.submit 2.71624 0.30541 8.894 < 2e-16 *** comment.view -1.49874 0.28597 -5.241 1.60e-07 *** comment.submit 16.46523 438.81887 0.038 0.970 search 16.46523 403.65465 0.041 0.967 gps.on -0.09124 0.33068 -0.276 0.783 --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2769.5 on 2200 degrees of freedom Residual deviance: 2192.1 on 2193 degrees of freedom AIC: 2208.1 Number of Fisher Scoring iterations: 14 > exp(Data.glm$coefficients)[-1] # ã©ã®å¤æ°ã®å¯ä¸åº¦ãå¼·ãããåºã post.view post.submit photo.submit comment.view 4.762663e-01 8.601370e+01 1.512329e+01 2.234122e-01 comment.submit search gps.on 1.415002e+07 1.415002e+07 9.127984e-01
post.submitãä¸çªè²¢ç®åº¦ã®é«ãã¢ã¯ã·ã§ã³ã ã¨ãããã¨ãåããã¾ããã¡ãªã¿ã«SVMã§ãåãããã«å¤æ°ãã¨ã®è²¢ç®åº¦ãç®åºãããã¨ã¯ã§ãã¾ãããRã ã¨æéããããããã§ã*6ã
ã©ã³ãã ãã©ã¬ã¹ã
è¿å¹´æ¥éã«åºã¾ã£ã¦ãããæ©æ¢°å¦ç¿åé¡å¨ã§ããå®ã¯ãã¼ã¹ã¯ãã ã®æ±ºå®æ¨ / å帰æ¨ãªãã§ãããããããã¼ãã¹ãã©ããã»ãªãµã³ããªã³ã°æ³ã¨çµã¿åããããã¨ã§é«éãã¤æ£ç¢ºã«åé¡çµæãå¾ãããããã«ãããã®ã§ãã
ããããæ±ºå®æ¨ã®ãã£ã¼ã ãã¤ã³ãã ã£ããã©ã®å¤æ°ãéè¦ãï¼ããï¼SVMã¨ã¯ç°ãªãï¼ã¹ãã¬ã¼ãã«æ±ãããã¨ãå¯è½ã§ã*7ããªã®ã§ããç¹ã«æªæ¥äºæ¸¬ãããããã§ã¯ãªããã©ã©ã®ãµã¼ãã¹ã使ã£ã¦ãããã¨ç¿æå®çããã¦ã¼ã¶ã¼ãå¢ãããï¼ãã¿ãããªãã¼ãºã«ã¯ã´ã£ããã®ææ³ã ã¨ãè¨ãã¾ãã
Rã§ã¯{randomForest}ããã±ã¼ã¸ã使ãã¾ãããã¼ã¿ã¯ãã¸ã¹ãã£ãã¯å帰ã§å©ç¨ããtjo_uu_behavior.txtãå¼ãç¶ãç¨ãã¾ãã
> Data.rf<-randomForest(Result~.,data=Data) # æ¸å¼ã¯å帰åæã¨åã > Data.rf$importance # 夿°éè¦åº¦ã表示ãã MeanDecreaseGini post.view 15.7243759 post.submit 104.0053984 photo.submit 44.4120623 comment.view 12.5603316 comment.submit 6.6833694 search 8.5228646 gps.on 0.2429852
SVMã¨åãããpost.submitãæãè²¢ç®åº¦ã®é«ãã¢ã¯ã·ã§ã³ã§ããã¨ããçµæã«ãªãã¾ããã
ãã ãããã®ã夿°éè¦åº¦ã(importance)ã¯ãã®ãåããï¼å®çãããoré¢è±ãããï¼ã¾ã§ã¯åãããªãã®ã§ãå¥ã®æ¹æ³ã¨çµã¿åãããå¿ è¦ãããã¾ããã¾ããè¨ç®è² è·ãçµæ§ã§ããã¦ãããã¸ã¹ãã£ãã¯å帰ãªãåããã©ã©ã³ãã ãã©ã¬ã¹ãã ã¨åããªãããã¨ãããã¾ããè¦æ³¨æã
ãªããå½ç¶ãªããã©ã³ãã ãã©ã¬ã¹ãã§ãSVMåæ§ã«predict()颿°ãç¨ãã¦ãäºæ¸¬ããè¡ããã¨ãå¯è½ã§ãããã¼ã¿ã®æ§è³ªæ¬¡ç¬¬ã§SVMã¨ã©ã³ãã ãã©ã¬ã¹ãã¨ã§äºæ¸¬ç²¾åº¦ãå¤ãããã¨ãããã®ã§ãäºåã«æ§è½æ¯è¼ãã¦ãããã¨ããè¦ããã¾ãã
ã¢ã½ã·ã¨ã¼ã·ã§ã³åæï¼ãã¹ã±ããåæã»ç¸é¢ã«ã¼ã«æ½åºï¼
ããããããã¹ã±ããåæãã§ããã¢ã¡ãªã«ã§æåã«ãªã£ãããã¼ã«ã¨ãªã ãã®ã¾ã¨ãè²·ããã®ä¾ã®ããã«ã徿¥ã¯ã©ã¡ããã¨ããã¨POSãªã©å°å£²åºã§ã®é¡§å®¢è³¼è²·ãã¼ã¿ã«ç¨ãããããã¨ãå¤ãã£ãããã§ãã
ã¨ããããwebãã¼ã¿åæã®ä¸çã§ãä¾ãã°ãç»é²ç¿æãæ¥è¨ªãã¦ãããã¦ã¼ã¶ã¼ã§ãã³ã³ãã³ãAãè¦ã¦ãã人ã¯ä»ã«ã³ã³ãã³ãB-Zã®ãã¡ã©ããä¸çªå¤ãè¦ã¦ãããï¼ãã¿ãããªãããµã¼ãã¹ãåããæã§æä¾ãããã¨ã§ãããªãã¼ããããããªããè¡åãã¿ã¼ã³ã®æ½åºã«ä½¿ããããã¨ãå¢ãã¦ãã¦ããããã§ãã
ã¨ãããã¨ã§ããããRã§ãã£ã¦ã¿ã¾ãã{arules}{arulesViz}ããã±ã¼ã¸ã使ãã¾ãããããµã³ãã«ãã¼ã¿ã¯ããã¿ã§ããGroceriesã§ãã
> data(Groceries) > data.ap<-apriori(Groceries) # Aprioriã¢ã«ã´ãªãºã ã§ã¢ã½ã·ã¨ã¼ã·ã§ã³ã»ã«ã¼ã«ãç®åºãã parameter specification: confidence minval smax arem aval originalSupport support minlen 0.8 0.1 1 none FALSE TRUE 0.1 1 maxlen target ext 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. sorting and recoding items ... [8 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 done [0.00s]. writing ... [0 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. > data.ap<-apriori(Groceries,parameter=list(support=0.001)) # ããã©ã«ãã ã¨æ¡ä»¶ãå³ãéãã¦ã«ã¼ã«ãåºã¦ããªãã®ã§ãæ¡ä»¶ãç·©ããã¦ã¿ã parameter specification: confidence minval smax arem aval originalSupport support minlen 0.8 0.1 1 none FALSE TRUE 0.001 1 maxlen target ext 10 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. sorting and recoding items ... [157 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 5 6 done [0.01s]. writing ... [410 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. > summary(data.ap) # ãµããªã¼ãè¦ã¦ã¿ã set of 410 rules rule length distribution (lhs + rhs):sizes 3 4 5 6 29 229 140 12 Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.000 4.000 4.329 5.000 6.000 summary of quality measures: support confidence lift Min. :0.001017 Min. :0.8000 Min. : 3.131 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.: 3.312 Median :0.001220 Median :0.8462 Median : 3.588 Mean :0.001247 Mean :0.8663 Mean : 3.951 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.: 4.341 Max. :0.003152 Max. :1.0000 Max. :11.235 mining info: data ntransactions support confidence Groceries 9835 0.001 0.8 > data.ap2<-subset(data.ap,subset=size(items)<4) # å¤éããã®ã§ã試ãã«4ã¤æªæºã®çµã¿åããã«çµã > summary(data.ap2) set of 29 rules rule length distribution (lhs + rhs):sizes 3 29 Min. 1st Qu. Median Mean 3rd Qu. Max. 3 3 3 3 3 3 summary of quality measures: support confidence lift Min. :0.001017 Min. :0.8000 Min. : 3.131 1st Qu.:0.001118 1st Qu.:0.8125 1st Qu.: 3.261 Median :0.001220 Median :0.8462 Median : 3.613 Mean :0.001473 Mean :0.8613 Mean : 4.000 3rd Qu.:0.001729 3rd Qu.:0.9091 3rd Qu.: 4.199 Max. :0.002542 Max. :1.0000 Max. :11.235 mining info: data ntransactions support confidence Groceries 9835 0.001 0.8 > inspect(head(sort(data.ap2,by="support"),n=10)) # ä¸ä½10ä»¶ã®çµã¿åãããåºãã¦ã¿ã lhs rhs support confidence lift 1 {hamburger meat, curd} => {whole milk} 0.002541942 0.8064516 3.156169 2 {herbs, rolls/buns} => {whole milk} 0.002440264 0.8000000 3.130919 3 {tropical fruit, herbs} => {whole milk} 0.002338587 0.8214286 3.214783 4 {liquor, red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 5 {yogurt, rice} => {other vegetables} 0.001931876 0.8260870 4.269346 6 {herbs, shopping bags} => {other vegetables} 0.001931876 0.8260870 4.269346 7 {pork, butter milk} => {other vegetables} 0.001830198 0.8571429 4.429848 8 {yogurt, cereals} => {whole milk} 0.001728521 0.8095238 3.168192 9 {meat, margarine} => {other vegetables} 0.001728521 0.8500000 4.392932 10 {hamburger meat, bottled beer} => {whole milk} 0.001728521 0.8095238 3.168192 > plot(data.ap2,method="graph",control=list(type="items",arrowSize=0.1),interactive=T) Loading required package: tcltk Tcl/Tkã¤ã³ã¿ã¼ãã§ã¼ã¹ã®ãã¼ãä¸ çµäºæ¸ # ã¤ã³ã¿ã©ã¯ãã£ããªã°ã©ã表示ã«ãã¦ã¿ã
ããã«ã¯ãã¨ããã®ä»éèãããããå¼·ãã§ããï¼ç¬ï¼ãããå½ããåã£ã¡ãå½ããåãªãã§ãããããããå°å£²ç³»ã®ãã¼ã¿ã ã¨æå¤æ§ã®ããçºè¦ã£ã¦å¤ããªããã§ããããããããwebãã¼ã¿åæã ã¨æ³åã ã«ããªãã£ããããªçµæãå¾ããããã¨ãããã®ã§ãå人çã«ã¯webãã¼ã¿åæåãã«å¼·ãæ¨ãããææ³ã®ä¸ã¤ã§ãã
ã¨ããã§ããã®{arules}ããã±ã¼ã¸ã§ç¨ãã¦ããã¢ã½ã·ã¨ã¼ã·ã§ã³ã»ã«ã¼ã«ã®ç®åºæ¹æ³ã¯ãå®ã¯ã¬ã³ã¡ã³ãã¢ã«ã´ãªãºã ã®ããã¨ããä¼¼ã¦ãã¾ããå®éã{recommenderlab}ã¨ããã¬ã³ã¡ã³ãã®ã·ãã¥ã¬ã¼ã·ã§ã³ããã±ã¼ã¸ã¯ã{arules}ãä¾åããã±ã¼ã¸ã¨ãã¦æå®ãã¦ãããã§ããããªã®ã§ãã¢ã½ã·ã¨ã¼ã·ã§ã³åæãè¡ããã¨ã§ãã¢ãããã¯ã§æåã§ã¬ã³ã¡ã³ããã¦ãããã¨ã«ãªãã¨ãè¨ãããã§ãã
è¨éæç³»ååæ
å®ã¯ãããwebãã¼ã¿åææ¥çã«ã¨ã£ã¦ã¯é¬¼éãåã®ç¥ãéãã§ã¯ããã®è¨éæç³»ååæãç©æ¥µçã«ãã¼ã¿ãµã¤ã¨ã³ã¹ã®å®åã«æå
¥ãã¦ããã¨ããã¯ã¾ã æ®ã©ãªãããã§ãã
以åã®ããã°è¨äºï¼見せかけの回帰についてï¼ã§ãæ°ççãªåºç¤ãå«ãã¦ãã©ã£ã¨è§¦ãã¾ããããæç³»åãã¼ã¿ãã¢ããªã³ã°ãã¦äºæ¸¬ã«å½¹ç«ã¦ããã¨ã¯éå¸¸ã«æçã§ããä»åã®è¨äºã§ã¯ãããã¾ã§ããããã®é¨åã ãã¡ããã£ã¨ãã£ã¦ã¿ããã¨ã«ãã¾ãã
ã¾ãåå¤éæç³»åãã¼ã¿ã«ã¤ãã¦ãRã§ã¯{forecast}ããã±ã¼ã¸ã便å©ã§ãã
> x.ts<-arima.sim(list(order=c(2,1,1),ar=c(0.2,-0.1),ma=0.1),n=200) # ARIMA(2,1,1)éç¨ã200ç¹çºçããã > x.arima<-auto.arima(x.ts,trace=T,stepwise=T) # çºçãããx.tsç³»åã®ARIMA次æ°ãæ¨å®ãã ARIMA(2,1,2) with drift : 572.951 ARIMA(0,1,0) with drift : 596.7827 ARIMA(1,1,0) with drift : 574.314 ARIMA(0,1,1) with drift : 570.8908 ARIMA(1,1,1) with drift : 571.74 ARIMA(0,1,2) with drift : 572.8034 ARIMA(1,1,2) with drift : 572.3238 ARIMA(0,1,1) : 569.8922 ARIMA(1,1,1) : 570.9663 ARIMA(0,1,0) : 596.6043 ARIMA(0,1,2) : 571.7132 ARIMA(1,1,2) : 571.3888 Best model: ARIMA(0,1,1) # æå¤ã¨AR次æ°ã¨MA次æ°ã®æ¨å®ã¯ææ§ã ã£ãããã > plot(forecast(x.arima,level=c(50,95),h=50)) # forecast()颿°ã§æªæ¥äºæ¸¬
ã¾ããå¤å¤éæç³»åã¢ãã«ã§ããVARã¢ãã«ã使ãã°ãäºãã«å½±é¿ãåã¼ãåãã¨äºæ³ãããè¤æ°ã®æç³»åãã¼ã¿å士ã®ã¤ã³ã¿ã©ã¯ã·ã§ã³ãèæ
®ãã¦ãåæã«ãããã®è¤æ°ã®æç³»åãã¼ã¿ã«å¯¾ããæªæ¥äºæ¸¬ãè¡ããã¨ãã§ãã¾ããããã§ã¯{vars}ããã±ã¼ã¸ãç¨ãã¾ãããµã³ãã«ãã¼ã¿ã¯å梱ã®Canadaã§ãã
> data(Canada) > VARselect(Canada) # VARã¢ãã«æ¬¡æ°ãæ¨å® $selection AIC(n) HQ(n) SC(n) FPE(n) 3 2 1 3 $criteria 1 2 3 4 5 6 7 8 AIC(n) -6.191599834 -6.621627919 -6.709002047 -6.512701777 -6.30174681 -6.194596715 -6.011720944 -6.054479536 HQ(n) -5.943189052 -6.174488511 -6.063134014 -5.668105118 -5.25842152 -4.952542805 -4.570938409 -4.414968375 SC(n) -5.568879538 -5.500731387 -5.089929279 -4.395452772 -3.68632157 -3.080995238 -2.399943231 -1.944525586 FPE(n) 0.002048239 0.001337721 0.001237985 0.001534875 0.00195439 0.002278812 0.002924622 0.003073249 9 10 AIC(n) -5.912126222 -5.867271844 HQ(n) -4.073886435 -3.830303432 SC(n) -1.303996035 -0.760965421 FPE(n) 0.004015164 0.004961704 > Canada.var<-VAR(Canada,p=3) # VARã¢ãã«ãæ¨å® > Canada.pred<-predict(Canada.var,n.ahead=20,ci=0.95) # 20æå ã¾ã§çæäºæ¸¬ > plot(Canada.pred)
4ã¤ã®æç³»åããããã®æªæ¥äºæ¸¬ãå¾ããã¦ãã¾ããåºæ¬çã«webãã¼ã¿åæã«ãããKPIãä»ã®å¤éããã®å½±é¿ãåããããã®ã§ãã§ããã ãVARã¢ãã«ä»¥ä¸å¤å¤éæç³»åã¢ãã«ãç¨ããæ¹ãè¯ãã¨åã¯èãã¦ãã¾ãã
ãã以å¤ã«ããå ææ§æ¤å®ãè¦ãããã®å帰ãå
±ååãGARCHãã¯ãã¾ããã«ã³ã転æã¢ãã«ã¨ãã£ãæ§ã
ãªæ¦å¿µã»ææ³ãè¨éæç³»ååæã«ã¯ããã¾ããããããã¯ã¾ãæ¹ãã¦ç´¹ä»ãã¾ããã¨ãããã¨ã§ã
ãããã«
ä»åã¯å
¨é¨Rã¡ã¤ã³ã§ãã£ã¦ã¿ã¾ãããã大åã®ææ³ã¯SPSSãªã©ã§ãå®è£
ããã¦ãã¾ã*8ãã¾ããå®éã«ããã¯ã¨ã³ãã·ã¹ãã ã«çµã¿è¾¼ãã ãèªååãããã¨ãèããã°ãPythonãªã©ã§çµããæ¹ãè¯ãã¨ããé¨åãããã¾ããå¿è«ãæ¢åã®ããã±ã¼ã¸ã»ã©ã¤ãã©ãªã§ã¯é£½ãããããèªä¸»ç ç©¶éçºããå¿
è¦ã«è¿«ããããã¨ãããã§ãããã
ã¨ããããã§ãã®è¨äºããå
¥å£ãã¨ãã¦ããããããwebãã¼ã¿åæï¼ãã¼ã¿ãµã¤ã¨ã³ã¹ã®ä¸çã«è¸ã¿å
¥ã£ã¦ããã人ãä¸äººã§ãå¢ããã°å¬ããã§ãã
ãã¾ã1ï¼ãç´ æ§ãã¯ãã«ï¼åé¡ã©ãã«ããªããã¼ã¿åå¦ç
以åã®è¨äºï¼Hiveで生テーブルを取ってくる→素性ベクトル+分類ラベルのテーブルに直すï¼ããåç
§ã®ãã¨ãããããªãã¨ãç¹ã«Rã®å ´åã¯æ©æ¢°å¦ç¿ã¯ã©ã®ææ³ã§ãã£ã¦ãããã¥ããã§ãã
ã¡ãªã¿ã«ãå®ã¯Hadoop + Hiveã§ãã£ã¦ãç´æ¥ãç´ æ§ãã¯ãã«ï¼åé¡ã©ãã«ãã«ãªããããªãã¼ã¿ãæ½åºãããã¨ãå¯è½ã§ã*9ããã®å ´åãã¨ã¯ã¹ãã¼ããããã¼ã¿ãç´æ¥Rã«èªã¿è¾¼ã¾ããã ãã§ãã¼ã¿åæã§ããã®ã§ä¾¿å©ã§ãã
ãã¾ã2ï¼ã°ã©ãçè«*10
ã¨ããã§ãã¢ã½ã·ã¨ã¼ã·ã§ã³åæã®ãã¼ã¿ã¯ã°ã©ãçè«ã®ã°ã©ãã¨ãã¦æ±ããã¨ãã§ãã¾ããããã«éãããä»å¾ã¯webãã¼ã¿åæã§ãã°ã©ãçè«ãæ´»èºããå ´é¢ã¯å¢ãã¦ããã ãããããã¨ããã®ãåã®è¦³æ¸¬ã§ãã
ã¶ã£ã¡ããåã¯ããã¯ç´ 人ãªã®ã§*11ãã¯ã£ããè¨ã£ã¦Rã®ããã±ã¼ã¸ç¾¤ã使ããªããã ã¾ãã ã¾ãç¬å¦ãã¦ããã¬ãã«ã§ãï¼ç¬ï¼ãã¨ã¯è¨ããä¾ãã°ãã«ã³ãéç¨ã£ã½ããã¨ã«ããç´åã¨ç¾å¨ã¨ã®ã¹ãã¼ã¿ã¹ã®é·ç§»ã«ããèå³ãæããªããã¨ãä»®å®ããã¨ãwebãã¼ã¿ããã§ãã¡ãã£ã¨ããã°ã©ãæ§é ãæ§æãããã¨ãã§ããã®ã§ãè²ã å¿ç¨ã§ããããããªããã¨æã£ã¦ã¾ãã
Rã§ã°ã©ãçè«ãããã®ã§ããã°ã以ä¸ã®ãã¼ã¸ãæãåèã«ãªããã¨ã
ã¡ãªã¿ã«{igraph}ããã±ã¼ã¸ã§ã¯Googleãã¼ã¸ã©ã³ã¯ãæ§æããã¢ã«ã´ãªãºã ã®ä¸ã¤ã§ããPage Rankãç®åºãããã¨ãã§ãã{igraph0} + {linkcomm}ããã±ã¼ã¸ã®çµã¿åããã§ã¯ãããã¯ã¼ã¯å
ã«åå¨ããä¸ä½ãããã¯ã¼ã¯ãæ¤åºãããã¨ãå¯è½ã§ãã
{igraph}ããã±ã¼ã¸ã§ã°ã©ãåæ
こちらのページã«é常ã«è¯ããµã³ãã«ããã£ãã®ã§ãæåãã¾ããããã¯ããTwitterã¢ã«ã¦ã³ãã®ãã¤ã¼ãã®単語文書行列ã§ãã
> load("termDocMatrix.rdata") > # change it to a Boolean matrix > termDocMatrix[termDocMatrix>=1] <- 1 > # transform into a term-term adjacency matrix > termMatrix <- termDocMatrix %*% t(termDocMatrix) > # inspect terms numbered 5 to 10 > termMatrix[5:10,5:10] Terms Terms data examples introduction mining network package data 53 5 2 34 0 7 examples 5 17 2 5 2 2 introduction 2 2 10 2 2 0 mining 34 5 2 47 1 5 network 0 2 2 1 17 1 package 7 2 0 5 1 21 > # build a graph from the above matrix > g <- graph.adjacency(termMatrix, weighted=T, mode = "undirected") > # remove loops > g <- simplify(g) > # set labels and degrees of vertices > V(g)$label <- V(g)$name > V(g)$degree <- degree(g) > # set seed to make the layout reproducible > set.seed(3952) > layout1 <- layout.fruchterman.reingold(g) > plot(g, layout=layout1)
Fruchterman-Reingoldã¢ã«ã´ãªãºã ã§æç»ããçµæã§ããä¸å¿ã«"data", "mining", "r"ãæ¥ã¦ãã¾ããããã®ã¢ã«ã´ãªãºã ã ã¨ããç¨åº¦é£æ¥ãã¦ãããã®å士ãè¿ãã«é ç½®ãããã®ã§ããã®3ã¤ã®åèªã¯ããªãé¢é£æ§ãå¼·ãã§ããããã¨ãããããã¾ãã
Page Rankããããªæãã§åºãã¾ãã"r"ã¨"data"ãå¼·ãã§ããã
> page.rank(g)$vector analysis applications code computing data examples 0.07022298 0.02249946 0.02695463 0.02793215 0.10116822 0.04917196 introduction mining network package parallel positions 0.02421137 0.09600309 0.04951537 0.04624544 0.02615590 0.02472511 postdoctoral r research series slides social 0.02990285 0.14125478 0.02646251 0.03275173 0.03787469 0.03888708 time tutorial users 0.03275173 0.04198358 0.05332538
ä»ã«ãbetweenessã¨ãcentralityã¨ãè²ã
ã°ã©ãå
¨ä½ã®æ§è³ªã表ãç¹å¾´éãç®åºãããã¨ãå¯è½ã§ãããããã§ã¯å²æãã¾ãã
{linkcomm}ããã±ã¼ã¸ã§ä¸ä½ãããã¯ã¼ã¯æ¤åº
ããã¾ã§ã¯ããããã°ã©ãåæã®è©±ã§ãããããããå
ã¯ãæè¿ã«ãªã£ã¦ç ç©¶éçºãé²ãããã¦ãããã°ã©ãããããã«ä¸ä½ã®ãããã¯ã¼ã¯orã°ã«ã¼ãã³ã°ããæ¤åºããã¨ããæ¹æ³è«ã®ã話ã§ããä¸è¨ãã¼ã¸ã§ãç´¹ä»ããã¦ããéãã{linkcomm} + {igraph0}ããã±ã¼ã¸ã§å®éã«åæãããã¨ãã§ãã¾ãã
ä¾ãã°{linkcomm}ããã±ã¼ã¸ã«å
¥ã£ã¦ãããµã³ãã«ãã¼ã¿ãkarate*12ãç¨ããã¨ãããªæãã«ãªãã¾ãã
> karate.g<-getLinkCommunities(karate,directed=T) Checking for loops and duplicate edges... 100.00% Calculating edge similarities for 78 edges... 100.00% Hierarchical clustering of edges... Calculating link densities... 100.00% Maximum partition density = 0.1632479 Finishing up...4/4... 100.00% Plotting... Colouring dendrogram... 100% > karate.ocg<-getOCG.clusters(karate) Calculating Initial class System....Done Nb. of classes 24 Nb. of edges not within the classes 13 Number of initial classes 24 Running.... Remaining classes: None Reading OCG data... Extracting cluster sizes... 100% > plot(karate.g) > plot(karate.g,type="graph") Getting node community edge density...100% Getting node layout... Constructing node pies...100% > plot(karate.ocg,type="graph") Getting node community edge density...100% Getting node layout... Constructing node pies...100%
getLinkCommunities()颿°ãå®è¡ããæç¹ã§ããã³ããã°ã©ã ã表示ããã¾ãããã®ããããããã¯ä¸ä½ãããã¯ã¼ã¯ã¸ã®åå²ã®æ§åãè¦ã¦åãã¾ããã¾ãããã®ä¸ã®2ã¤ã®plot()颿°ã§ä»¥ä¸ã®ããã«ä¸ä½ãããã¯ã¼ã¯ãå³ç¤ºãããã¨ãã§ãã¾ãã
ãã®ç©ºæã¯ã©ããã2ã¤ã®å¤§ããªæ´¾é¥ã°ã«ã¼ãã«åããã¦ãããã¨ã宿§çã«åããã¾ããããæ ¼éæã®å£ä½ã§ããããæ´¾é¥ãããã®ã£ã¦å±ãªãã¨æããã§ããï¼ç¬ï¼ãåæ§ã®ãã¼ã¿ã»ãããæ½åºãããã¨ããã§ããã°ããã¡ããwebãã¼ã¿åæã§ãååã«ä½¿ããææ³ã ã¨æãã¾ãã
ï¼â»â»ç¨èªã®èª¤ããããã¤ããã£ãã®ã§ç´ãã¾ãããããããªãããããï¼
ï¼â»â»â»id:yag_aysããã®ãææã«å¾ããã°ã©ãçè«å¨ãã¯è¨è¿°ãæ¹ãã¾ããããæææé£ããããã¾ããï¼ããããªããï¼
*1:ç·ãã¨ããã©ããé£ãã§ããã®ã¯æ¬ 測å¤ã¨ãã¦NAãå ¥ã£ã¦ããããã§ã
*2:Rã ã¨å¯è½ãªã¢ãã«ãããç¨åº¦ç¶²ç¾ çã«è©¦ãã¦ãããã«AICã¨ãã§æé©ã¢ãã«ãé¸ã¹ã
*3:ä¸å¿ãã·ã¬ã³å·ã®åèªã®ããã«æ¸ãã¦ããã¨ãè¿å¹´ã«ãªã£ã¦ç¹ã«ã·ã«ã´è¿è¾ºã¯æ±åè¿ä¸ã¨ã°ããã«æ²»å®ã®åä¸ãé²ã¿ãä»ãã·ã«ã´ã®ãã¦ã³ã¿ã¦ã³ã¯å¤ã§ã女æ§ã®ç¬ãæ©ããã§ãããããå®å ¨ãªè¡ã«ãªã£ã¦ãã¾ãã念ã®ãã
*4:å³å¯ã«ã¯ã¸ãä¿æ°ã使ããã¨ã³ãããã¼ã使ããã§ææ³ãå¤ãã
*5:3åé¡ä»¥ä¸ã«åããããææ³ãå®è£ ãã¦ããããã±ã¼ã¸ããã
*6:SPSSã¯ã¢ã³ãã«ã«ãã»ã·ãã¥ã¬ã¼ã·ã§ã³ãä½ãã§åºãã«è¡ã£ã¦ããããã§ãã
*7:ã¸ãä¿æ°ãç¨ããCARTç³»åã®å ´åãã¸ãä¿æ°æ¸å°åº¦ã§å¤æ°éè¦åº¦ã表ããã¨ãå¯è½
*8:ãã ãè¨éæç³»ååæã¯SPSSã§ã¯æèãªã®ã§è¦æ³¨æ
*9:ã£ã¦ãæè¿ãããããHiveã¯ã¨ãªã®æ¸ãæ¹ãç¥ã£ã
*10:ææãåãã¦ä¿®æ£ãã¾ãã
*11:ç´ äººããã®èª¤ããç¯ãã¦ããã¾ããããããªãã
*12:"A social network of friendships between 34 members of a karate club at a US university in the 1970s (Zachary 1977)"ã¨æ¸ãã¦ããéããå®éã®ã¢ã¡ãªã«ã®å¤§å¦ã®ç©ºæå好ä¼ã«ãããå¦çå士ã®é¢ä¿æ§ãåæãã¦å¾ããã¼ã¿ãããã§ã