æ©æ¢°å¦ç¿æç¿ã: éè¦åº¦ã«ããé»åã¡ã¼ã«ã®ä¸¦ã³æ¿ã
ãå ¥é æ©æ¢°å¦ç¿ãæç¿ãã4æ¥ç®ãã4ç« é ä½ã¥ã:åªå ãã¬ã¤ãã§ãã
é»åã¡ã¼ã«ãéè¦åº¦ã§é ä½ã¥ãããã·ã¹ãã ãä½ãã¾ã
並ã³æ¿ãã®ã¢ããã¼ã
以ä¸ã®ç´ æ§ã使ã£ã¦ãã¡ã¼ã«ã«åªå 度ãã¤ãã¾ãã
- 1) éä¿¡è
ã®ã¡ãã»ã¼ã¸æ°
- ããåããå¤ãéä¿¡è ããã®ã¡ã¼ã«ã¯éè¦ã¨ã¿ãªã
- 2) ã¹ã¬ããã®æ´»æ§
- æ´»çºã«ããåãããã¦ããã¹ã¬ããã®ã¡ã¼ã«ã¯åªå 度ãé«ããã
- æ´»æ§åº¦ã¯ã1ç§ãããã®ã¹ã¬ããã®ã¡ã¼ã«æ°ã§ç®åºãå¤ãã»ã©ãæ´»æ§ãé«ã
- 3) ãµãã¸ã§ã¯ãã¨æ¬æã«å«ã¾ããåèª
å¿ è¦ãªã¢ã¸ã¥ã¼ã«ã¨ãã¼ã¿ã®èªã¿è¾¼ã¿
> setwd("04-Ranking/") > library('tm') > library('ggplot2') > library('plyr') > library('reshape') # ã¡ã¼ã«ãã¼ã¿ã¯ã3ç« ã§ä½¿ã£ãéã¹ãã ã¡ã¼ã«(æ)ã使ã > data.path <- file.path("..", "03-Classification", "data") > easyham.path <- file.path(data.path, "easy_ham")
ã¡ã¼ã«ããç´ æ§ãåãåºã
ãã¡ã¤ã«ãèªã¿è¾¼ãã§ãæ¬æãè¿ãé¢æ°ãä½æãã¾ãã
# ã¡ã¼ã«ãã¡ã¤ã«ãããå ¨ãã¼ã¿ãèªã¿è¾¼ãã§è¿ã > msg.full <- function(path) { con <- file(path, open = "rt", encoding = "latin1") msg <- readLines(con) close(con) return(msg) } # ã¡ã¼ã«ãã¼ã¿ããFromã¢ãã¬ã¹ãåãåºã > get.from <- function(msg.vec) { from <- msg.vec[grepl("From: ", msg.vec)] from <- strsplit(from, '[":<> ]')[1] from <- from[which(from != "" & from != " ")] return(from[grepl("@", from)][1]) } # ã¡ã¼ã«ãã¼ã¿ããããµãã¸ã§ã¯ããåãåºã > get.subject <- function(msg.vec) { subj <- msg.vec[grepl("Subject: ", msg.vec)] if(length(subj) > 0) { return(strsplit(subj, "Subject: ")[1][2]) } else { return("") } } # ã¡ã¼ã«ãã¼ã¿ãããæ¬æãåãåºã > get.msg <- function(msg.vec) { msg <- msg.vec[seq(which(msg.vec == "")[1] + 1, length(msg.vec), 1)] return(paste(msg, collapse = "\n")) } # ã¡ã¼ã«ãã¼ã¿ãããåä¿¡æ¥æãåãåºã > get.date <- function(msg.vec) { date.grep <- grepl("^Date: ", msg.vec) date.grep <- which(date.grep == TRUE) date <- msg.vec[date.grep[1]] date <- strsplit(date, "\\+|\\-|: ")[1][2] date <- gsub("^\\s+|\\s+$", "", date) return(strtrim(date, 25)) } # ã¡ã¼ã«ãèªã¿è¾¼ãã§ãå¿ è¦ãªç´ æ§ãè¿ãã > parse.email <- function(path) { full.msg <- msg.full(path) date <- get.date(full.msg) from <- get.from(full.msg) subj <- get.subject(full.msg) msg <- get.msg(full.msg) return(c(date, from, subj, msg, path)) }
åä½ãã¹ãã
> parse.email("../03-Classification/data/easy_ham/00111.a478af0547f2fd548f7b412df2e71a92") [1] "Mon, 7 Oct 2002 10:37:26" [2] "[email protected]" ...
ã¡ã¼ã«ãã¼ã¿ãèªã¿è¾¼ãã§ç´ æ§ããã¼ã¿ãã¬ã¼ã ã«ã¾ã¨ãã
# å ¨ã¡ã¼ã«ã解æ > easyham.docs <- dir(easyham.path) > easyham.docs <- easyham.docs[which(easyham.docs != "cmds")] > easyham.parse <- lapply(easyham.docs, function(p) parse.email(file.path(easyham.path, p))) # ãã¼ã¿ãã¬ã¼ã ã«å¤æ > ehparse.matrix <- do.call(rbind, easyham.parse) > allparse.df <- data.frame(ehparse.matrix, stringsAsFactors = FALSE) > names(allparse.df) <- c("Date", "From.EMail", "Subject", "Message", "Path")
ã§ããã
> head(allparse.df) Date From.EMail 1 Thu, 22 Aug 2002 18:26:25 kre@munnari.OZ.AU 2 Thu, 22 Aug 2002 12:46:18 steve.burt@cursor-system.com 3 Thu, 22 Aug 2002 13:52:38 timc@2ubh.com 4 Thu, 22 Aug 2002 09:15:25 monty@roscom.com
ãã¼ã¿ã®èª¿æ´
éä¿¡æ¥æãæååã«ãªã£ã¦ããã®ã§ãPOSIXãªãã¸ã§ã¯ãã«å¤æãã¾ãã
# æ¥æ¬èªç°å¢ã ã¨ã%b ã Aug ãªã©ã®æåã«ãããããªããããå¤æ´ãã¦ããã > Sys.setlocale(locale="C") > date.converter <- function(dates, pattern1, pattern2) { pattern1.convert <- strptime(dates, pattern1) pattern2.convert <- strptime(dates, pattern2) pattern1.convert[is.na(pattern1.convert)] <- pattern2.convert[is.na(pattern1.convert)] return(pattern1.convert) } > pattern1 <- "%a, %d %b %Y %H:%M:%S" > pattern2 <- "%d %b %Y %H:%M:%S" > allparse.df$Date <- date.converter(allparse.df$Date, pattern1, pattern2) > head(allparse.df) Date From.EMail 1 2002-08-22 18:26:25 kre@munnari.OZ.AU 2 2002-08-22 12:46:18 steve.burt@cursor-system.com 3 2002-08-22 13:52:38 timc@2ubh.com 4 2002-08-22 09:15:25 monty@roscom.com 5 2002-08-22 14:38:22 Stewart.Smith@ee.ed.ac.uk # ãã±ã¼ã«ãæ»ãã¦ããã > Sys.setlocale(local="ja_JP.UTF-8")
ã¾ãããµãã¸ã§ã¯ãã¨éä¿¡è ã¢ãã¬ã¹ãå°æåã«å¤æ´ãã¾ãã
> allparse.df$Subject <- tolower(allparse.df$Subject) > allparse.df$From.EMail <- tolower(allparse.df$From.EMail)
æå¾ã«ãéä¿¡æ¥æã§ã½ã¼ãã
> priority.df <- allparse.df[with(allparse.df, order(Date)), ]
ãã¼ã¿ã®æåã®ååãè¨ç·´ãã¼ã¿ã«ä½¿ãã®ã§ãå¥ã®å¤æ°ã«æ ¼ç´ãã¦ããã¾ãã
> priority.train <- priority.df[1:(round(nrow(priority.df) / 2)), ]
éä¿¡è å¥ã¡ã¼ã«ä»¶æ°ã§ã®éã¿ã¥ã
éä¿¡è ãã¨ã®ã¡ã¼ã«ä»¶æ°ã§éã¿ã¥ããè¡ããããã¾ãã¯ã件æ°ãã©ããªæãã«ãªã£ã¦ããã確èªãã¾ãã
éä¿¡è ãã¨ã®ã¡ã¼ã«ä»¶æ°ãéè¨ã
> from.weight <- melt(with(priority.train, table(From.EMail))) > from.weight <- from.weight[with(from.weight, order(value)), ] > head(from.weight) From.EMail value 1 adam@homeport.org 1 2 admin@networksonline.com 1 4 albert.white@ireland.sun.com 1 5 andr@sandy.ru 1 6 andris@aernet.ru 1 9 antoin@eire.com 1 > summary(from.weight$value) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 1.00 2.00 4.63 4.00 55.00
å¹³åã¯4.63éãæ大å¤ã¯55éã§ã°ãã¤ãã大ããããª? 7é以ä¸éä¿¡ãã¦ããã¢ãã¬ã¹ãã°ã©ãã«è¡¨ç¤ºãã¦ã¿ã¾ãã
> from.ex <- subset(from.weight, value >= 7) > from.scales <- ggplot(from.ex) + geom_rect(aes(xmin = 1:nrow(from.ex) - 0.5, xmax = 1:nrow(from.ex) + 0.5, ymin = 0, ymax = value, fill = "lightgrey", color = "darkblue")) + scale_x_continuous(breaks = 1:nrow(from.ex), labels = from.ex$From.EMail) + coord_flip() + scale_fill_manual(values = c("lightgrey" = "lightgrey"), guide = "none") + scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") + ylab("Number of Emails Received (truncated at 6)") + xlab("Sender Address") + theme_bw() + theme(axis.text.y = element_text(size = 5, hjust = 1)) > ggsave(plot = from.scales, filename = file.path("images", "0011_from_scales.png"), height = 4.8, width = 7)
ä¸é¨ã®éä¿¡è ããå¹³åçãªéä¿¡è ã®10å以ä¸ãã¡ã¼ã«ãéä¿¡ãã¦ãã¾ããéä¿¡æ°ããã®ã¾ã¾éã¿ã«ãã¦ãã¾ãã¨ããããã®ç¹æ®ãªéä¿¡è ã®åªå 度ãé«ããªãããã¦ãã¾ãã¾ããã°ã©ããè¦ãã¨ãææ°é¢æ°çã«å¢ãã¦ããæããªã®ã§ãèªç¶å¯¾æ°ã使ã£ã¦éã¿ã調æ´ãã¾ãã
# 対æ°ãåããéã¿ãã¼ãã«ãªããªãããã«ãå¤ã«1ã足ãã > from.weight <- transform(from.weight, Weight = log(value + 1), log10Weight = log10(value + 1))
ã¹ã¬ããæ´»æ§ã§ã®éã¿ã¥ã
ã¾ãã¯ãã¹ã¬ããå¥ã®ã¡ã¼ã«ä»¶æ°ãéè¨ãã¾ãã
ã¡ã¼ã«ãã¹ã¬ããã«å±ãããã©ããã¯ãã¡ã¼ã«ã®ãµãã¸ã§ã¯ããè¦ã¦å¤å®ãã¾ãã
# re:ãé¤ãããµãã¸ã§ã¯ã(=ã¹ã¬ããå)ã¨éä¿¡è ãåãåºãã > find.threads <- function(email.df) { response.threads <- strsplit(email.df$Subject, "re: ") is.thread <- sapply(response.threads, function(subj) ifelse(subj[1] == "", TRUE, FALSE)) threads <- response.threads[is.thread] senders <- email.df$From.EMail[is.thread] threads <- sapply(threads, function(t) paste(t[2:length(t)], collapse = "re: ")) return(cbind(senders,threads)) } > threads.matrix <- find.threads(priority.train) > head(threads.matrix) senders threads [1,] "[email protected]" "new sequences window" [2,] "[email protected]" "[zzzzteana] nothing like mama used to make" [3,] "[email protected]" "[zzzzteana] nothing like mama used to make" [4,] "[email protected]" "[zzzzteana] nothing like mama used to make" [5,] "[email protected]" "[sadev] live rule updates after release ???" [6,] "[email protected]" "new sequences window"
次ã«ãã¹ã¬ãããã¨ã®æ´»æ§åº¦ãéè¨ãã¾ãã
# ã¹ã¬ãããã¨ã®æ´»æ§åº¦ä¸è¦§ãè¿ã > get.threads <- function(threads.matrix, email.df) { threads <- unique(threads.matrix[, 2]) thread.counts <- lapply(threads, function(t) thread.counts(t, email.df)) thread.matrix <- do.call(rbind, thread.counts) return(cbind(threads, thread.matrix)) } # ã¹ã¬ããåã«å±ããã¡ã¼ã«ã®æ´»æ§åº¦ãè¿ã > thread.counts <- function(thread, email.df) { # ã¡ã¼ã«ãããã¹ã¬ããã«å±ããã¡ã¼ã«ã®éä¿¡æ¥æãåãåºã thread.times <- email.df$Date[which(email.df$Subject == thread | email.df$Subject == paste("re:", thread))] freq <- length(thread.times) # ã¹ã¬ããã®ã¡ã¼ã«ã®ç·æ° min.time <- min(thread.times) # éä¿¡æ¥æã®æå°å¤ max.time <- max(thread.times) # éä¿¡æ¥æã®æå¤§å¤ time.span <- as.numeric(difftime(max.time, min.time, units = "secs")) if(freq < 2) { # ã¡ã¼ã«ã1éãããªãå ´å(è¿ä¿¡ããªãã¹ã¬ããã«ãªã£ã¦ããªãå ´å)ãNAãè¿ã return(c(NA, NA, NA)) } else { trans.weight <- freq / time.span # 1ç§å½ããã®ã¡ã¼ã«éä¿¡æ° log.trans.weight <- 10 + log(trans.weight, base = 10) # 対æ°ãåããè² ã«ãªããªãããã10ã足ã(ã¢ãã£ãå¤æ) return(c(freq, time.span, log.trans.weight)) } } > thread.weights <- data.frame(thread.weights, stringsAsFactors = FALSE) > names(thread.weights) <- c("Thread", "Freq", "Response", "Weight") > thread.weights$Freq <- as.numeric(thread.weights$Freq) > thread.weights$Response <- as.numeric(thread.weights$Response) > thread.weights$Weight <- as.numeric(thread.weights$Weight) > thread.weights <- subset(thread.weights, is.na(thread.weights$Freq) == FALSE) > head(thread.weights) Thread Freq Response Weight 1 please help a newbie compile mplayer :-) 4 42309 5.975627 2 prob. w/ install/uninstall 4 23745 6.226488 3 http://apt.nixia.no/ 10 265303 5.576258 4 problems with 'apt-get -f install' 3 55960 5.729244 5 problems with apt update 2 6347 6.498461 6 about apt, kernel updates and dist-upgrade 5 240238 5.318328
ã¾ããéä¿¡è ã§ã®éã¿ã¥ãã®è£å®ã¨ãã¦ããéä¿¡è ãä½ã¹ã¬ããã«åå ãã¦ããããã示ãéã¿ãè¨ç®ãã¦ããã¾ãã
> email.thread <- function(threads.matrix) { senders <- threads.matrix[, 1] senders.freq <- table(senders) senders.matrix <- cbind(names(senders.freq), senders.freq, log(senders.freq + 1)) senders.df <- data.frame(senders.matrix, stringsAsFactors=FALSE) row.names(senders.df) <- 1:nrow(senders.df) names(senders.df) <- c("From.EMail", "Freq", "Weight") senders.df$Freq <- as.numeric(senders.df$Freq) senders.df$Weight <- as.numeric(senders.df$Weight) return(senders.df) } > senders.df <- email.thread(threads.matrix) > head(senders.df) From.EMail Freq Weight 1 adam@homeport.org 1 0.6931472 2 aeriksson@fastmail.fm 5 1.7917595 3 albert.white@ireland.sun.com 1 0.6931472 4 alex@netwindows.org 1 0.6931472 5 andr@sandy.ru 1 0.6931472 6 andris@aernet.ru 1 0.6931472
ãµãã¸ã§ã¯ãã¨æ¬æã«å«ã¾ããåèªã«ããéã¿ã¥ã
ã¾ãã¯ããµãã¸ã§ã¯ãã
- ã¹ã¬ããåã«å«ã¾ããåèªä¸è¦§ãæ½åºãã¦ãåèªãã¨ã«éã¿ãè¨ç®ãã¾ãã
- åèªãå«ãå ¨ã¹ã¬ããã®weightãåãåºãã¦ããã®å¹³åãéã¿ã¨ãã¦ä½¿ãã¾ãã
# åèªã¨åºç¾é »åº¦ã®ä¸è¦§ãè¿ã > term.counts <- function(term.vec, control) { vec.corpus <- Corpus(VectorSource(term.vec)) vec.tdm <- TermDocumentMatrix(vec.corpus, control = control) return(rowSums(as.matrix(vec.tdm))) } # ã¹ã¬ããåã«å«ã¾ããåèªä¸è¦§ãæ½åº > thread.terms <- term.counts(thread.weights$Thread, control = list(stopwords = TRUE)) > thread.terms <- names(thread.terms) # åºç¾é »åº¦ã¯ä½¿ããªãã®ã§æ¨ã¦ã > head(thread.terms) [1] "--with" ":-)" "..." ".doc" "'apt-get" "\"holiday" # åèªãã¨ã«éã¿ãç®åº # åèªãå«ãå ¨ã¹ã¬ããã®weightãåãåºãã¦ããã®å¹³åãéã¿ã¨ãã¦ä½¿ã > term.weights <- sapply(thread.terms, function(t) mean(thread.weights$Weight[grepl(t, thread.weights$Thread, fixed = TRUE)])) > head(term.weights) --with :-) ... .doc 'apt-get "holiday 7.109579 6.103883 6.050786 5.725911 5.729244 7.197911 # æ´å½¢ > term.weights <- data.frame(list(Term = names(term.weights), Weight = term.weights), stringsAsFactors = FALSE, row.names = 1:length(term.weights)) > head(term.weights) Term Weight 1 --with 7.109579 2 :-) 6.103883 3 ... 6.050786 4 .doc 5.725911 5 'apt-get 5.729244 6 "holiday 7.197911
次ã«æ¬æã
# æ¬æã«å«ã¾ããåèªã¨é »åº¦ãéè¨ > msg.terms <- term.counts(priority.train$Message, control = list(stopwords = TRUE, removePunctuation = TRUE, removeNumbers = TRUE)) # éã¿ãç®åºãããã§ã対æ°ãã¨ã > msg.weights <- data.frame(list(Term = names(msg.terms), Weight = log(msg.terms, base = 10)), stringsAsFactors = FALSE, row.names = 1:length(msg.terms)) # éã¿ãã¼ãã®ãã®ã¯é¤å¤ > msg.weights <- subset(msg.weights, Weight > 0)
ããã§ããã¹ã¦ã®éã¿ãã¼ã¿ãã¬ã¼ã ããããã¾ããã
é ä½ã¥ããè¡ã
éè¦åº¦ãè¨ç®ããé¢æ°ãå®ç¾©ãã¾ãã
# åèªã®éã¿ãè¿ã # åèªãæ¤ç´¢ããéã¿ãã¼ã¿ãã¬ã¼ã ãterm.weightãæ¤ç´¢å¯¾è±¡ãã©ããããå¼æ°ã§åãåããéã¿ãè¿ãã > get.weights <- function(search.term, weight.df, term = TRUE) { if(length(search.term) > 0) { # weight.dfãterm.weightãã©ããã§ååãç°ãªãã®ã§ãããã§èª¿æ´ if(term) { term.match <- match(names(search.term), weight.df$Term) } else { term.match <- match(search.term, weight.df$Thread) } match.weights <- weight.df$Weight[which(!is.na(term.match))] if(length(match.weights) < 1) { # ããããã件æ°ãã¼ãã®å ´åã1ã使ã return(1) } else { # ããããã件æ°ã1以ä¸ã®å ´åãå¹³åã使ã return(mean(match.weights)) } } else { return(1) } } # ã¡ã¼ã«ã®éè¦åº¦ãè¿ã > rank.message <- function(path) { # ã¡ã¼ã«ã解æ msg <- parse.email(path) # éä¿¡è ãéä¿¡ããã¡ã¼ã«æ°ã«åºã¥ãéã¿ãåå¾ from <- ifelse(length(which(from.weight$From.EMail == msg[2])) > 0, from.weight$Weight[which(from.weight$From.EMail == msg[2])], 1) # éä¿¡è ãåå ããã¹ã¬ããæ°ã«åºã¥ãéã¿ãåå¾ thread.from <- ifelse(length(which(senders.df$From.EMail == msg[2])) > 0, senders.df$Weight[which(senders.df$From.EMail == msg[2])], 1) # ã¡ã¼ã«ãã¹ã¬ããã¸ã®æéãã©ãããå¤å®ããã¹ã¬ããã¸ã®æ稿ã§ããã°ãã¹ã¬ããã®éã¿ãåå¾ subj <- strsplit(tolower(msg[3]), "re: ") is.thread <- ifelse(subj[[1]][1] == "", TRUE, FALSE) if(is.thread){ activity <- get.weights(subj[[1]][2], thread.weights, term = FALSE) } else { # ã¹ã¬ããã¸ã®æ稿ã§ãªãå ´åãéã¿ã¯1 activity <- 1 } # ã¡ã¼ã«ãµãã¸ã§ã¯ãã«åºã¥ãéã¿ãåå¾ thread.terms <- term.counts(msg[3], control = list(stopwords = TRUE)) thread.terms.weights <- get.weights(thread.terms, term.weights) # ã¡ã¼ã«æ¬æã«åºã¥ãéã¿ãåå¾ msg.terms <- term.counts(msg[4], control = list(stopwords = TRUE, removePunctuation = TRUE, removeNumbers = TRUE)) msg.weights <- get.weights(msg.terms, msg.weights) # éã¿ããã¹ã¦æãåããã¦ãéè¦åº¦ãç®åºãã rank <- prod(from, thread.from, activity, thread.terms.weights, msg.weights) return(c(msg[1], msg[2], msg[3], rank)) }
åä½ãã¹ãã
> rank.message("../03-Classification/data/easy_ham/00111.a478af0547f2fd548f7b412df2e71a92") [1] "Mon, 7 Oct 2002 10:37:26" [2] "[email protected]" [3] "Re: [ILUG] Interesting article on free software licences" [4] "5.27542087468428"
åªå ã¡ã¼ã«ã¨ã¿ãªãé¾å¤ã妥å½ã確èªãã
ä»åã¯ãåªå 度ã®ä¸å¤®å¤ãé¾å¤ã¨ãã¦ä½¿ãã¾ãã ãã¼ã¿ã®ååã使ã£ã¦ãé¾å¤ã妥å½ããã§ãã¯ãã¾ãã
train.paths <- priority.df$Path[1:(round(nrow(priority.df) / 2))] test.paths <- priority.df$Path[((round(nrow(priority.df) / 2)) + 1):nrow(priority.df)] # train.pathsã«å«ã¾ããã¡ã¼ã«ã®éè¦åº¦ãç®åº train.ranks <- suppressWarnings(lapply(train.paths, rank.message)) # ãã¼ã¿ãã¬ã¼ã ã«å¤æ > train.ranks.matrix <- do.call(rbind, train.ranks) > train.ranks.matrix <- cbind(train.paths, train.ranks.matrix, "TRAINING") > train.ranks.df <- data.frame(train.ranks.matrix, stringsAsFactors = FALSE) > names(train.ranks.df) <- c("Message", "Date", "From", "Subj", "Rank", "Type") > train.ranks.df$Rank <- as.numeric(train.ranks.df$Rank) > head(train.ranks.df) Message 1 ../03-Classification/data/easy_ham/01061.6610124afa2a5844d41951439d1c1068 2 ../03-Classification/data/easy_ham/01062.ef7955b391f9b161f3f2106c8cda5edb 3 ../03-Classification/data/easy_ham/01063.ad3449bd2890a29828ac3978ca8c02ab 4 ../03-Classification/data/easy_ham/01064.9f4fc60b4e27bba3561e322c82d5f7ff 5 ../03-Classification/data/easy_ham/01070.6e34c1053a1840779780a315fb083057 6 ../03-Classification/data/easy_ham/01072.81ed44b31e111f9c1e47e53f4dfbefe3 Date From 1 Thu, 31 Jan 2002 22:44:14 robinderbains@shaw.ca 2 01 Feb 2002 00:53:41 lance_tt@bellsouth.net 3 Fri, 01 Feb 2002 02:01:44 robinderbains@shaw.ca 4 Fri, 1 Feb 2002 10:29:23 matthias@egwn.net 5 Fri, 1 Feb 2002 12:42:02 bfrench@ematic.com 6 Fri, 1 Feb 2002 13:39:31 bfrench@ematic.com Subj Rank Type 1 Please help a newbie compile mplayer :-) 3.614003 TRAINING 2 Re: Please help a newbie compile mplayer :-) 120.742481 TRAINING 3 Re: Please help a newbie compile mplayer :-) 20.348502 TRAINING 4 Re: Please help a newbie compile mplayer :-) 307.809626 TRAINING 5 Prob. w/ install/uninstall 3.653047 TRAINING 6 RE: Prob. w/ install/uninstall 21.685750 TRAINING
é¾å¤ãä¸å¤®å¤ã«è¨å®ãã¦ãè¨ç·´ãã¼ã¿ã®éè¦åº¦ã¨å¯åº¦ãå³ã«ãã¾ãã
# é¾å¤ãä¸å¤®å¤ã«è¨å® > priority.threshold <- median(train.ranks.df$Rank) # è¨ç·´ãã¼ã¿ã®éè¦åº¦ã¨å¯åº¦ãå³ç¤º > threshold.plot <- ggplot(train.ranks.df, aes(x = Rank)) + stat_density(aes(fill="darkred")) + geom_vline(xintercept = priority.threshold, linetype = 2) + scale_fill_manual(values = c("darkred" = "darkred"), guide = "none") + theme_bw() > ggsave(plot = threshold.plot, filename = file.path("images", "01_threshold_plot.png"), height = 4.7, width = 7)
å³ä¸ã®ç¹ç·ãä¸å¤®å¤ã ãããé¾å¤ã«ããã°ãã©ã³ã¯ã®é«ã裾é¨åã¨ãå¯åº¦ã®é«ãé¨åã®é»åã¡ã¼ã«ãããç¨åº¦å«ã¾ããã®ã§ãããããåªå ã¡ã¼ã«ã¨å¤å®ããã®ã§ããããã
æ®ãã®ãã¼ã¿ãå ãã¦ãå³ã«ãã¦ã¿ã¾ãã
# test.ranksã«å«ã¾ããã¡ã¼ã«ã®éè¦åº¦ãç®åº > train.ranks.df$Priority <- ifelse(train.ranks.df$Rank >= priority.threshold, 1, 0) > test.ranks <- suppressWarnings(lapply(test.paths,rank.message)) > test.ranks.matrix <- do.call(rbind, test.ranks) > test.ranks.matrix <- cbind(test.paths, test.ranks.matrix, "TESTING") > test.ranks.df <- data.frame(test.ranks.matrix, stringsAsFactors = FALSE) > names(test.ranks.df) <- c("Message","Date","From","Subj","Rank","Type") > test.ranks.df$Rank <- as.numeric(test.ranks.df$Rank) > test.ranks.df$Priority <- ifelse(test.ranks.df$Rank >= priority.threshold, 1, 0) # è¨ç·´ç¨ãã¼ã¿ã¨ãã¹ãç¨ãã¼ã¿ããã¼ã¸ > final.df <- rbind(train.ranks.df, test.ranks.df) > final.df$Date <- date.converter(final.df$Date, pattern1, pattern2) > final.df <- final.df[rev(with(final.df, order(Date))), ] > head(final.df) Message 2500 ../03-Classification/data/easy_ham/00883.c44a035e7589e83076b7f1fed8fa97d5 2499 ../03-Classification/data/easy_ham/02500.05b3496ce7bca306bed0805425ec8621 2498 ../03-Classification/data/easy_ham/02499.b4af165650f138b10f9941f6cc5bce3c 2497 ../03-Classification/data/easy_ham/02498.09835f512f156da210efb99fcc523e21 2496 ../03-Classification/data/easy_ham/02497.60497db0a06c2132ec2374b2898084d3 2495 ../03-Classification/data/easy_ham/02496.aae0c81581895acfe65323f344340856 Date From 2500 <NA> sdw@lig.net 2499 <NA> ilug_gmc@fiachra.ucd.ie 2498 <NA> mwh@python.net 2497 <NA> nickm@go2.ie 2496 <NA> phil@techworks.ie 2495 <NA> timc@2ubh.com Subj Rank Type 2500 Re: ActiveBuddy 6.219744 TESTING 2499 Re: [ILUG] Linux Install 2.278890 TESTING 2498 [Spambayes] Re: New Application of SpamBayesian tech? 4.265954 TESTING 2497 Re: [ILUG] Linux Install 4.576643 TESTING 2496 Re: [ILUG] Linux Install 3.652100 TESTING 2495 [zzzzteana] Surfing the tube 27.987331 TESTING Priority 2500 0 2499 0 2498 0 2497 0 2496 0 2495 1 # å³ç¤º > testing.plot <- ggplot(subset(final.df, Type == "TRAINING"), aes(x = Rank)) + stat_density(aes(fill = Type, alpha = 0.65)) + stat_density(data = subset(final.df, Type == "TESTING"), aes(fill = Type, alpha = 0.65)) + geom_vline(xintercept = priority.threshold, linetype = 2) + scale_alpha(guide = "none") + scale_fill_manual(values = c("TRAINING" = "darkred", "TESTING" = "darkblue")) + theme_bw() > ggsave(plot = testing.plot, filename = file.path("images", "02_testing_plot.png"), height = 4.7, width = 7)
ãã¹ããã¼ã¿ã¯ãè¨ç·´ãã¼ã¿ããåªå 度ä½ã®ã¡ã¼ã«ãå¤ãå«ã¾ããçµæã«ãªã£ã¦ãã¾ãã ããã¯ããã¹ããã¼ã¿ã®ç´ æ§ã«ãè¨ç·´ãã¼ã¿ã«å«ã¾ããªããã¼ã¿ãå¤ãå«ã¾ããããããé åºä»ãæã«ç¡è¦ããã¦ããããã§ããã妥å½ãããããµãã
æå¾ã«åªå 度ä¸è¦§ãcsvã«åºåãã¦ããã¾ãã
write.csv(final.df, file.path("data", "final_df.csv"), row.names = FALSE)