æ©æ¢°å¦ç¿æç¿ã: kè¿åæ³ã使ã£ãã¬ã³ã¡ã³ãã·ã¹ãã ãä½ã
ãå ¥é æ©æ¢°å¦ç¿ãæç¿ãã10æ¥ç®ãã10ç« kè¿åæ³:æ¨è¦ã·ã¹ãã ãã§ãã
kè¿åæ³ã使ã£ã¦é¡ä¼¼åº¦ã®é«ããã¼ã¿ãéããææ³ãå¦ã³ãããã使ã£ã¦Rã¦ã¼ã¶ã¼ã«Rããã±ã¼ã¸ãæ¨è¦ããã·ã¹ãã ãä½ãã¾ãã
# åæºå > setwd("10-Recommendations/")
kè¿åæ³ã®æ¦è¦
kè¿åæ³ã¯ããã¼ã¿éã®è·é¢ã使ã£ã¦ãã¼ã¿ãåé¡ããææ³ã§ãã kè¿åæ³ã§ããã°ã以ä¸ã®ãããªããã¼ã¿ã®ã°ãã¤ããå¤ããã¹ãã ãã£ã«ã¿ãä½ãã¨ãã«ä½¿ã£ããç·å½¢æ±ºå®å¢çã§ã®åé¡ãé£ãããã¼ã¿ããã¾ãåé¡ãããã¨ãã§ãã¾ãã
> library('ggplot2') > df <- read.csv(file.path('data', 'example_data.csv')) > head(df) X Y Label 1 2.373546 5.398106 0 2 3.183643 4.387974 0 3 2.164371 5.341120 0 4 4.595281 3.870637 0 5 3.329508 6.433024 0 6 2.179532 6.980400 0 > plot <- ggplot(df, aes(x = X, y = Y)) + geom_point(aes(shape = Label)) + scale_shape_identity() > ggsave(plot, filename = "01.png")
ãã®ãã¼ã¿ããkè¿åæ³ã使ã£ã¦ãããç¹ã®ãã¼ã¿ã®Labelãäºæ¸¬ãã¦ã¿ã¾ãã ã¾ãã¯ãåç¹ãã¨ã®ã¦ã¼ã¯ãªããè·é¢ãè¨ç®ããé¢æ°ãç¨æã
# ãã¼ã¿ãã¬ã¼ã å ã®åç¹éã®è·é¢ãæ ¼ç´ãããããªãã¯ã¹ãä½ã > distance.matrix <- function(df) { distance <- matrix(rep(NA, nrow(df) ^ 2), nrow = nrow(df)) for (i in 1:nrow(df)) { for (j in 1:nrow(df)) { distance[i, j] <- sqrt((df[i, 'X'] - df[j, 'X']) ^ 2 + (df[i, 'Y'] - df[j, 'Y']) ^ 2) } } return(distance) }
ç¹å®ã®ç¹ããæãè¿ãç¹ãkåè¿ãé¢æ°ãä½ãã¾ãã
# distanceã®iã®ä½ç½®ã«ããç¹ã«æãè¿ãç¹ãkåè¿ãã k.nearest.neighbors <- function(i, distance, k = 5) { return(order(distance[i, ])[2:(k + 1)]) }
æå¾ã«ãäºæ¸¬ãè¡ãknné¢æ°ãå®ç¾©ã
> knn <- function(df, k = 5) { # è·é¢è¡åãè¨ç® distance <- distance.matrix(df) # çµæãæ ¼ç´ããè¡åãä½ã predictions <- rep(NA, nrow(df)) for (i in 1:nrow(df)) { # ãã£ã¨ãè¿ã5åãåãåºããå¤æ°æ±ºã§ã©ãã«ã®å¤ãæ¨æ¸¬ãã indices <- k.nearest.neighbors(i, distance, k = k) predictions[i] <- ifelse(mean(df[indices, 'Label']) > 0.5, 1, 0) } return(predictions) }
æºåãã§ããã®ã§ãå®è¡ãã¦ç²¾åº¦ãè©ä¾¡ãã¦ã¿ã¾ãã
> df <- transform(df, kNNPredictions = knn(df)) sum(with(df, Label != kNNPredictions)) [1] 7
7ã¤å¤±æãã¾ããããã¼ã¿æ°ã¯100ãªã®ã§ã精度ã¯93%ã
âã®kè¿åæ³ã®å®è£ ã¯ãå®ã¯Rã«ç¨æããã¦ãããããã®ã§ããã¡ãã使ã£ã¦ã¿ã¾ãã
> rm('knn') # èªä½ããknnãåé¤ > library('class') > df <- read.csv(file.path('data', 'example_data.csv')) # ãã¼ã¿ã®ååãã©ã³ãã æ½åºããä¸æ¹ã§è¨ç·´ãä»æ¹ã§è©ä¾¡ãè¡ãã > n <- nrow(df) > set.seed(1) > indices <- sort(sample(1:n, n * (1 / 2))) > training.x <- df[indices, 1:2] > test.x <- df[-indices, 1:2] > training.y <- df[indices, 3] > test.y <- df[-indices, 3] # æ¨æ¸¬ãå®è¡ > predicted.y <- knn(training.x, test.x, training.y, k = 5) > sum(predicted.y != test.y) [1] 7
7ã¤å¤±æããã¼ã¿æ°ã¯50ãªã®ã§ã86%ã®ç²¾åº¦ã§ãã
æ¯è¼ã®ããããã¸ã¹ãã£ãã¯å帰ã§ã®æ¨æ¸¬ãè¡ã£ã¦ã¿ã¾ãã
> logit.model <- glm(Label ~ X + Y, data = df[indices, ]) > predictions <- as.numeric(predict(logit.model, newdata = df[-indices, ]) > 0) > sum(predictions != test.y) [1] 16
ãã¡ãã¯16å失æã§ã精度ã¯68%ããã®ãããªç·å½¢åé¡ãé£ãããã¼ã¿ã®å ´åãkè¿åæ³ã¯ç·å½¢åé¡å¨ãããã¾ãåä½ãã¾ãã
Rããã±ã¼ã¸ã®ã¬ã³ã¡ã³ãã·ã¹ãã ãä½ã
æ¦è¦ãã¤ãããã¨ããã§ãkè¿åæ³ã使ã£ã¦Rã¦ã¼ã¶ã¼ã«ãå§ãã®Rããã±ã¼ã¸ãã¬ã³ã¡ã³ãããã·ã¹ãã ãä½ãã¾ãã
å ·ä½çã«ã¯ãã¦ã¼ã¶ã¼ãã¤ã³ã¹ãã¼ã«ãã¦ããããã±ã¼ã¸æ å ±ãããããã¨ä¼¼ãããã±ã¼ã¸ãæ¢ãã¦æ¨è¦ããã¢ã¤ãã ãã¼ã¹ã®ã¬ã³ã¡ã³ããè¡ãã¾ãã
ã¾ãã¯ããã±ã¼ã¸ãã¼ã¿ã®èªã¿è¾¼ã¿ã
> installations <- read.csv(file.path('data', 'installations.csv')) > head(installations) Package User Installed 1 abind 1 1 2 AcceptanceSampling 1 0 3 ACCLMA 1 0 4 accuracy 1 1 5 acepack 1 0 6 aCGH.Spline 1 0
Installed
ã 1ã®ãã®ãã¦ã¼ã¶ã¼ãã¤ã³ã¹ãã¼ã«ãã¦ããããã±ã¼ã¸ã示ãã¾ãã
ã¾ãã¯ãã®ãã¼ã¿ããããã±ã¼ã¸ x ã¦ã¼ã¶ã¼ãã®è¡åã«å¤æãã¾ãã
> library('reshape') > user.package.matrix <- cast(installations, User ~ Package, value = 'Installed') > user.package.matrix[, 1] > row.names(user.package.matrix) <- user.package.matrix[, 1] > user.package.matrix <- user.package.matrix[, -1] > head(user.package.matrix[1:6, 1:6]) abind AcceptanceSampling ACCLMA accuracy acepack aCGH.Spline 1 1 0 0 1 0 0 3 1 1 0 1 1 1 4 0 1 1 1 1 0 5 1 1 1 0 1 0 6 1 1 1 0 1 0 7 1 1 1 0 1 1
ãã®è¡åãå¼æ°ã« cor
ãå®è¡ããååã®ç¸é¢ä¿æ°ãè¨ç®ãã¾ãã
ä»åã¯ããããé¡ä¼¼åº¦ã測ãææ¨ã¨ãã¦ä½¿ãã¾ãã
> similarities <- cor(user.package.matrix) > similarities[1:6, 1:6] [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.00000000 -0.04822428 -0.19485928 -0.074935401 0.111369209 0.114616917 [2,] -0.04822428 1.00000000 0.32940392 -0.152710227 -0.151554446 -0.129640745 [3,] -0.19485928 0.32940392 1.00000000 -0.194859284 0.129323382 0.068326672 [4,] -0.07493540 -0.15271023 -0.19485928 1.000000000 -0.129930744 0.006251832 [5,] 0.11136921 -0.15155445 0.12932338 -0.129930744 1.000000000 0.007484812 [6,] 0.11461692 -0.12964074 0.06832667 0.006251832 0.007484812 1.000000000
ç¸é¢ä¿æ°ã¯1ï½-1ã®æ°å¤ãªã®ã§ããããã1ãè·é¢ã¼ã(æãè¿ã)ã-1ãè·é¢ç¡é大(æãé ã)ã«ãªãããã«å¤æãã¾ãã
> distances <- -log((similarities / 2) + 0.5) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.0000000 0.7425730 0.9098854 0.7710389 0.5875544 0.5846364 [2,] 0.7425730 0.0000000 0.4084165 0.8588597 0.8574965 0.8319964 [3,] 0.9098854 0.4084165 0.0000000 0.9098854 0.5715285 0.6270536 [4,] 0.7710389 0.8588597 0.9098854 0.0000000 0.8323296 0.6869148 [5,] 0.5875544 0.8574965 0.5715285 0.8323296 0.0000000 0.6856902 [6,] 0.5846364 0.8319964 0.6270536 0.6869148 0.6856902 0.0000000
ãã¼ã¿ãã§ããã®ã§ãkè¿åã®ããã±ã¼ã¸ãåãåºãã¦ãããã±ã¼ã¸ã®ã¤ã³ã¹ãã¼ã«ç¢ºçãç®åºããé¢æ°ãæ¸ãã¾ãã
# distanceã®iã®ä½ç½®ã«ããç¹ã«æãè¿ãç¹ãkåè¿ãã > k.nearest.neighbors <- function(i, distances, k = 25) { return(order(distances[i, ])[2:(k + 1)]) } # æå®ã¦ã¼ã¶ã¼ãæå®ããããã±ã¼ã¸ãã¤ã³ã¹ãã¼ã«ãã¦ãã確çãè¨ç®ãã > installation.probability <- function(user, package, user.package.matrix, distances, k = 25){ # è¿ãã®ããã±ã¼ã¸ãåå¾ neighbors <- k.nearest.neighbors(package, distances, k = k) # è¿é£ããã±ã¼ã¸ã®ã¤ã³ã¹ãã¼ã«çã®å¹³åãããã±ã¼ã¸ã®ã¤ã³ã¹ãã¼ã«çã¨ãã¦è¿ãã return(mean(sapply(neighbors, function (neighbor) {user.package.matrix[user, neighbor]}))) }
åä½ç¢ºèªãã¦ã¼ã¶ã¼1ã1ã¤ãã®ããã±ã¼ã¸ãã¤ã³ã¹ãã¼ã«ãã確çã¯76%ã
> installation.probability(1, 1, user.package.matrix, distances) [1] 0.76
ãã¨ã¯ããã¹ã¦ã®ããã±ã¼ã¸ã®ã¤ã³ã¹ãã¼ã«ç¢ºçãç®åºãã確çä¸ä½ã®ãã®ãåãåºãã°ãã¬ã³ã¡ã³ãã¨ã³ã¸ã³ã®åºæ¥ä¸ããã
# å ¨ããã±ã¼ã¸ã®ã¤ã³ã¹ãã¼ã«ç¢ºçãè¨ç®ããé«ãé ã«ã½ã¼ããã¦è¿ãã > most.probable.packages <- function(user, user.package.matrix, distances, k = 25){ return(order(sapply(1:ncol(user.package.matrix), function (package) { installation.probability(user, package, user.package.matrix, distances, k = k) }), decreasing = TRUE)) }
å®è¡ãã¦ã¿ã¾ãã
# ã¤ã³ã¹ãã¼ã«ç¢ºçé ã«ã½ã¼ãããããã±ã¼ã¸ä¸è¦§ãåå¾ > listing = most.probable.packages(1, user.package.matrix, distances) # ä¸ä½5件ã®ããã±ã¼ã¸åã表示 > colnames(user.package.matrix)[listing[1:5]] > colnames(user.package.matrix)[listing[1:5]] [1] "adegenet" "AIGIS" "ConvergenceConcepts" [4] "corcounts" "DBI"