ã€Œå…¥é–€ æ©Ÿæ¢°å¦ç¿’ã€æ‰‹ç¿’ã„ã€4æ—¥ç›®ã€‚ã€Œ4ç« é †ä½ã¥ã‘:å„ªå…ˆãƒˆãƒ¬ã‚¤ã€ã§ã™ã€‚

é›»åãƒ¡ãƒ¼ãƒ«ã‚’é‡è¦åº¦ã§é †ä½ã¥ã‘ã™ã‚‹ã‚·ã‚¹ãƒ†ãƒ ã‚’ä½œã‚Šã¾ã™

ä¸¦ã³æ›¿ãˆã®ã‚¢ãƒ—ãƒãƒ¼ãƒ

ä»¥ä¸‹ã®ç´ æ€§ã‚’ä½¿ã£ã¦ã€ãƒ¡ãƒ¼ãƒ«ã«å„ªå…ˆåº¦ã‚’ã¤ã‘ã¾ã™ã€‚

1) é€ä¿¡è€…ã®ãƒ¡ãƒƒã‚»ãƒ¼ã‚¸æ•°
- ã‚„ã‚Šå–ã‚ŠãŒå¤šã„é€ä¿¡è€…ã‹ã‚‰ã®ãƒ¡ãƒ¼ãƒ«ã¯é‡è¦ã¨ã¿ãªã™
2) ã‚¹ãƒ¬ãƒƒãƒ‰ã®æ´»æ€§
- æ´»ç™ºã«ã‚„ã‚Šå–ã‚Šã•ã‚Œã¦ã„ã‚‹ã‚¹ãƒ¬ãƒƒãƒ‰ã®ãƒ¡ãƒ¼ãƒ«ã¯å„ªå…ˆåº¦ã‚’é«˜ãã™ã‚‹
- æ´»æ€§åº¦ã¯ã€1ç§’ã‚ãŸã‚Šã®ã‚¹ãƒ¬ãƒƒãƒ‰ã®ãƒ¡ãƒ¼ãƒ«æ•°ã§ç®—å‡ºã€‚å¤šã„ã»ã©ã€æ´»æ€§ãŒé«˜ã„
3) ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªž
- æ´»å‹•çš„ãªã‚¹ãƒ¬ãƒƒãƒ‰ã®ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã§é »å‡ºã™ã‚‹å˜èªžã‚’ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã«å«ã‚€ãƒ¡ãƒ¼ãƒ«ã¯ã€é‡è¦åº¦ãŒé«˜ã„ã¨ã¿ãªã™
- æœ¬æ–‡ã«é »å‡ºå˜èªžãŒå«ã¾ã‚Œã¦ã„ã‚‹ãƒ¡ãƒ¼ãƒ«ã¯é‡è¦åº¦ãŒé«˜ã„ã¨ã¿ãªã™

å¿…è¦ãªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã¨ãƒ‡ãƒ¼ã‚¿ã®èªã¿è¾¼ã¿

> setwd("04-Ranking/")  
> library('tm')
> library('ggplot2')
> library('plyr')
> library('reshape')

# ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã¯ã€3ç« ã§ä½¿ã£ãŸéžã‚¹ãƒ‘ãƒ ãƒ¡ãƒ¼ãƒ«(æ˜“)ã‚’ä½¿ã†
> data.path <- file.path("..", "03-Classification", "data")
> easyham.path <- file.path(data.path, "easy_ham")

ãƒ¡ãƒ¼ãƒ«ã‹ã‚‰ç´ æ€§ã‚’å–ã‚Šå‡ºã™

ãƒ•ã‚¡ã‚¤ãƒ«ã‚’èªã¿è¾¼ã‚“ã§ã€æœ¬æ–‡ã‚’è¿”ã™é–¢æ•°ã‚’ä½œæˆã—ã¾ã™ã€‚

# ãƒ¡ãƒ¼ãƒ«ãƒ•ã‚¡ã‚¤ãƒ«ã‹ã‚‰ã€å…¨ãƒ‡ãƒ¼ã‚¿ã‚’èªã¿è¾¼ã‚“ã§è¿”ã™
> msg.full <- function(path) {
  con <- file(path, open = "rt", encoding = "latin1")
  msg <- readLines(con)
  close(con)
  return(msg)
}

# ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‹ã‚‰Fromã‚¢ãƒ‰ãƒ¬ã‚¹ã‚’å–ã‚Šå‡ºã™
> get.from <- function(msg.vec) {
  from <- msg.vec[grepl("From: ", msg.vec)]
  from <- strsplit(from, '[":<> ]')[1]
  from <- from[which(from  != "" & from != " ")]
  return(from[grepl("@", from)][1])
}

# ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‹ã‚‰ã€ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã‚’å–ã‚Šå‡ºã™
> get.subject <- function(msg.vec) {
  subj <- msg.vec[grepl("Subject: ", msg.vec)]
  if(length(subj) > 0) {
    return(strsplit(subj, "Subject: ")[1][2])
  } else {
    return("")
  }
}

# ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‹ã‚‰ã€æœ¬æ–‡ã‚’å–ã‚Šå‡ºã™
> get.msg <- function(msg.vec) {
  msg <- msg.vec[seq(which(msg.vec == "")[1] + 1, length(msg.vec), 1)]
  return(paste(msg, collapse = "\n"))
}

# ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‹ã‚‰ã€å—ä¿¡æ—¥æ™‚ã‚’å–ã‚Šå‡ºã™
> get.date <- function(msg.vec) {
  date.grep <- grepl("^Date: ", msg.vec)
  date.grep <- which(date.grep == TRUE)
  date <- msg.vec[date.grep[1]]
  date <- strsplit(date, "\\+|\\-|: ")[1][2]
  date <- gsub("^\\s+|\\s+$", "", date)
  return(strtrim(date, 25))
}

# ãƒ¡ãƒ¼ãƒ«ã‚’èªã¿è¾¼ã‚“ã§ã€å¿…è¦ãªç´ æ€§ã‚’è¿”ã™ã€‚
> parse.email <- function(path) {
  full.msg <- msg.full(path)
  date <- get.date(full.msg)
  from <- get.from(full.msg)
  subj <- get.subject(full.msg)
  msg <- get.msg(full.msg)
  return(c(date, from, subj, msg, path))
}

å‹•ä½œãƒ†ã‚¹ãƒˆã€‚

> parse.email("../03-Classification/data/easy_ham/00111.a478af0547f2fd548f7b412df2e71a92")
[1] "Mon, 7 Oct 2002 10:37:26"
[2] "[email protected]"  
...

ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‚’èªã¿è¾¼ã‚“ã§ç´ æ€§ã‚’ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«ã¾ã¨ã‚ã‚‹

# å…¨ãƒ¡ãƒ¼ãƒ«ã‚’è§£æž
> easyham.docs <- dir(easyham.path)
> easyham.docs <- easyham.docs[which(easyham.docs != "cmds")]
> easyham.parse <- lapply(easyham.docs, function(p) parse.email(file.path(easyham.path, p)))
# ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«å¤‰æ›
> ehparse.matrix <- do.call(rbind, easyham.parse)
> allparse.df <- data.frame(ehparse.matrix, stringsAsFactors = FALSE)
> names(allparse.df) <- c("Date", "From.EMail", "Subject", "Message", "Path")

ã§ããŸã€‚

> head(allparse.df)
                       Date                   From.EMail
1 Thu, 22 Aug 2002 18:26:25            kre@munnari.OZ.AU
2 Thu, 22 Aug 2002 12:46:18 steve.burt@cursor-system.com
3 Thu, 22 Aug 2002 13:52:38                timc@2ubh.com
4 Thu, 22 Aug 2002 09:15:25             monty@roscom.com

ãƒ‡ãƒ¼ã‚¿ã®èª¿æ•´

é€ä¿¡æ—¥æ™‚ãŒæ–‡å—åˆ—ã«ãªã£ã¦ã„ã‚‹ã®ã§ã€POSIXã‚ªãƒ–ã‚¸ã‚§ã‚¯ãƒˆã«å¤‰æ›ã—ã¾ã™ã€‚

# æ—¥æœ¬èªžç’°å¢ƒã ã¨ã€%b ãŒ Aug ãªã©ã®æœˆåã«ãƒžãƒƒãƒã—ãªã„ãŸã‚ã€å¤‰æ›´ã—ã¦ãŠãã€‚
> Sys.setlocale(locale="C")
> date.converter <- function(dates, pattern1, pattern2) {
  pattern1.convert <- strptime(dates, pattern1)
  pattern2.convert <- strptime(dates, pattern2)
  pattern1.convert[is.na(pattern1.convert)] <- pattern2.convert[is.na(pattern1.convert)]
  return(pattern1.convert)
}
> pattern1 <- "%a, %d %b %Y %H:%M:%S"
> pattern2 <- "%d %b %Y %H:%M:%S"
> allparse.df$Date <- date.converter(allparse.df$Date, pattern1, pattern2)
> head(allparse.df)                                                                                                                                                
                 Date                   From.EMail
1 2002-08-22 18:26:25            kre@munnari.OZ.AU
2 2002-08-22 12:46:18 steve.burt@cursor-system.com
3 2002-08-22 13:52:38                timc@2ubh.com
4 2002-08-22 09:15:25             monty@roscom.com
5 2002-08-22 14:38:22    Stewart.Smith@ee.ed.ac.uk

# ãƒã‚±ãƒ¼ãƒ«ã‚’æˆ»ã—ã¦ãŠãã€‚
> Sys.setlocale(local="ja_JP.UTF-8")

ã¾ãŸã€ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨é€ä¿¡è€…ã‚¢ãƒ‰ãƒ¬ã‚¹ã‚’å°æ–‡å—ã«å¤‰æ›´ã—ã¾ã™ã€‚

> allparse.df$Subject <- tolower(allparse.df$Subject)
> allparse.df$From.EMail <- tolower(allparse.df$From.EMail)

æœ€å¾Œã«ã€é€ä¿¡æ—¥æ™‚ã§ã‚½ãƒ¼ãƒˆã€‚

> priority.df <- allparse.df[with(allparse.df, order(Date)), ]

ãƒ‡ãƒ¼ã‚¿ã®æœ€åˆã®åŠåˆ†ã‚’è¨“ç·´ãƒ‡ãƒ¼ã‚¿ã«ä½¿ã†ã®ã§ã€åˆ¥ã®å¤‰æ•°ã«æ ¼ç´ã—ã¦ãŠãã¾ã™ã€‚

> priority.train <- priority.df[1:(round(nrow(priority.df) / 2)), ]

é€ä¿¡è€…åˆ¥ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã§ã®é‡ã¿ã¥ã‘

é€ä¿¡è€…ã”ã¨ã®ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã§é‡ã¿ã¥ã‘ã‚’è¡Œã†ãŸã‚ã€ã¾ãšã¯ã€ä»¶æ•°ãŒã©ã‚“ãªæ„Ÿã˜ã«ãªã£ã¦ã„ã‚‹ã‹ç¢ºèªã—ã¾ã™ã€‚

é€ä¿¡è€…ã”ã¨ã®ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã‚’é›†è¨ˆã€‚

> from.weight <- melt(with(priority.train, table(From.EMail)))                                                                                        
> from.weight <- from.weight[with(from.weight, order(value)), ]
> head(from.weight)
                    From.EMail value
1            adam@homeport.org     1
2     admin@networksonline.com     1
4 albert.white@ireland.sun.com     1
5                andr@sandy.ru     1
6             andris@aernet.ru     1
9              antoin@eire.com     1

> summary(from.weight$value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    2.00    4.63    4.00   55.00

å¹³å‡ã¯4.63é€šã€‚æœ€å¤§å€¤ã¯55é€šã§ã°ã‚‰ã¤ããŒå¤§ãã„ã‹ãª? 7é€šä»¥ä¸Šé€ä¿¡ã—ã¦ã„ã‚‹ã‚¢ãƒ‰ãƒ¬ã‚¹ã‚’ã‚°ãƒ©ãƒ•ã«è¡¨ç¤ºã—ã¦ã¿ã¾ã™ã€‚

> from.ex <- subset(from.weight, value >= 7)
> from.scales <- ggplot(from.ex) +
  geom_rect(aes(xmin = 1:nrow(from.ex) - 0.5,
                xmax = 1:nrow(from.ex) + 0.5,
                ymin = 0,
                ymax = value,
                fill = "lightgrey",
                color = "darkblue")) +
  scale_x_continuous(breaks = 1:nrow(from.ex), labels = from.ex$From.EMail) +
  coord_flip() +
  scale_fill_manual(values = c("lightgrey" = "lightgrey"), guide = "none") +
  scale_color_manual(values = c("darkblue" = "darkblue"), guide = "none") +
  ylab("Number of Emails Received (truncated at 6)") +
  xlab("Sender Address") +
  theme_bw() +
  theme(axis.text.y = element_text(size = 5, hjust = 1))
> ggsave(plot = from.scales,
       filename = file.path("images", "0011_from_scales.png"),
       height = 4.8,
       width = 7)

f:id:unageanu:20160112182344p:plain

ä¸€éƒ¨ã®é€ä¿¡è€…ãŒã€å¹³å‡çš„ãªé€ä¿¡è€…ã®10å€ä»¥ä¸Šã€ãƒ¡ãƒ¼ãƒ«ã‚’é€ä¿¡ã—ã¦ã„ã¾ã™ã€‚é€ä¿¡æ•°ã‚’ãã®ã¾ã¾é‡ã¿ã«ã—ã¦ã—ã¾ã†ã¨ã€ã“ã‚Œã‚‰ã®ç‰¹æ®Šãªé€ä¿¡è€…ã®å„ªå…ˆåº¦ãŒé«˜ããªã‚Šã™ãŽã¦ã—ã¾ã„ã¾ã™ã€‚ã‚°ãƒ©ãƒ•ã‚’è¦‹ã‚‹ã¨ã€æŒ‡æ•°é–¢æ•°çš„ã«å¢—ãˆã¦ã„ã‚‹æ„Ÿã˜ãªã®ã§ã€è‡ªç„¶å¯¾æ•°ã‚’ä½¿ã£ã¦é‡ã¿ã‚’èª¿æ•´ã—ã¾ã™ã€‚

# å¯¾æ•°ã‚’å–ã‚‹ã€‚é‡ã¿ãŒã‚¼ãƒã«ãªã‚‰ãªã„ã‚ˆã†ã«ã€å€¤ã«1ã‚’è¶³ã™ã€‚
> from.weight <- transform(from.weight, 
  Weight = log(value + 1), log10Weight = log10(value + 1))

ã‚¹ãƒ¬ãƒƒãƒ‰æ´»æ€§ã§ã®é‡ã¿ã¥ã‘

ã¾ãšã¯ã€ã‚¹ãƒ¬ãƒƒãƒ‰åˆ¥ã®ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã‚’é›†è¨ˆã—ã¾ã™ã€‚

ãƒ¡ãƒ¼ãƒ«ãŒã‚¹ãƒ¬ãƒƒãƒ‰ã«å±žã™ã‚‹ã‹ã©ã†ã‹ã¯ã€ãƒ¡ãƒ¼ãƒ«ã®ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã‚’è¦‹ã¦åˆ¤å®šã—ã¾ã™ã€‚

# re:ã‚’é™¤ã„ãŸã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆ(=ã‚¹ãƒ¬ãƒƒãƒ‰å)ã¨é€ä¿¡è€…ã‚’å–ã‚Šå‡ºã™ã€‚
> find.threads <- function(email.df) {
  response.threads <- strsplit(email.df$Subject, "re: ")
  is.thread <- sapply(response.threads, function(subj) ifelse(subj[1] == "", TRUE, FALSE))
  threads <- response.threads[is.thread]
  senders <- email.df$From.EMail[is.thread]
  threads <- sapply(threads, function(t) paste(t[2:length(t)], collapse = "re: "))
  return(cbind(senders,threads))
}
> threads.matrix <- find.threads(priority.train)
> head(threads.matrix)
     senders                     threads                                      
[1,] "[email protected]"         "new sequences window"                       
[2,] "[email protected]" "[zzzzteana] nothing like mama used to make" 
[3,] "[email protected]"  "[zzzzteana] nothing like mama used to make" 
[4,] "[email protected]" "[zzzzteana] nothing like mama used to make" 
[5,] "[email protected]"           "[sadev] live rule updates after release ???"
[6,] "[email protected]"     "new sequences window"

æ¬¡ã«ã€ã‚¹ãƒ¬ãƒƒãƒ‰ã”ã¨ã®æ´»æ€§åº¦ã‚’é›†è¨ˆã—ã¾ã™ã€‚

# ã‚¹ãƒ¬ãƒƒãƒ‰ã”ã¨ã®æ´»æ€§åº¦ä¸€è¦§ã‚’è¿”ã™
> get.threads <- function(threads.matrix, email.df) {
  threads <- unique(threads.matrix[, 2])
  thread.counts <- lapply(threads, function(t) thread.counts(t, email.df))
  thread.matrix <- do.call(rbind, thread.counts)
  return(cbind(threads, thread.matrix))
}

# ã‚¹ãƒ¬ãƒƒãƒ‰åã«å±žã™ã‚‹ãƒ¡ãƒ¼ãƒ«ã®æ´»æ€§åº¦ã‚’è¿”ã™
> thread.counts <- function(thread, email.df) {
  # ãƒ¡ãƒ¼ãƒ«ã‹ã‚‰ã€ã‚¹ãƒ¬ãƒƒãƒ‰ã«å±žã™ã‚‹ãƒ¡ãƒ¼ãƒ«ã®é€ä¿¡æ—¥æ™‚ã‚’å–ã‚Šå‡ºã™
  thread.times <- email.df$Date[which(email.df$Subject == thread |
                                      email.df$Subject == paste("re:", thread))]
  freq <- length(thread.times)  # ã‚¹ãƒ¬ãƒƒãƒ‰ã®ãƒ¡ãƒ¼ãƒ«ã®ç·æ•°
  min.time <- min(thread.times) # é€ä¿¡æ—¥æ™‚ã®æœ€å°å€¤
  max.time <- max(thread.times) # é€ä¿¡æ—¥æ™‚ã®æœ€å¤§å€¤
  time.span <- as.numeric(difftime(max.time, min.time, units = "secs"))
  if(freq < 2) {
    # ãƒ¡ãƒ¼ãƒ«ãŒ1é€šã—ã‹ãªã„å ´åˆ(è¿”ä¿¡ãŒãªãã‚¹ãƒ¬ãƒƒãƒ‰ã«ãªã£ã¦ã„ãªã„å ´åˆ)ã€NAã‚’è¿”ã™
    return(c(NA, NA, NA))
  } else {
    trans.weight <- freq / time.span # 1ç§’å½“ãŸã‚Šã®ãƒ¡ãƒ¼ãƒ«é€ä¿¡æ•°
    log.trans.weight <- 10 + log(trans.weight, base = 10) 
      # å¯¾æ•°ã‚’å–ã‚‹ã€‚è² ã«ãªã‚‰ãªã„ã‚ˆã†ã€10ã‚’è¶³ã™(ã‚¢ãƒ•ã‚£ã‚“å¤‰æ›)
    return(c(freq, time.span, log.trans.weight))
  }
}
> thread.weights <- data.frame(thread.weights, stringsAsFactors = FALSE)
> names(thread.weights) <- c("Thread", "Freq", "Response", "Weight")
> thread.weights$Freq <- as.numeric(thread.weights$Freq)
> thread.weights$Response <- as.numeric(thread.weights$Response)
> thread.weights$Weight <- as.numeric(thread.weights$Weight)
> thread.weights <- subset(thread.weights, is.na(thread.weights$Freq) == FALSE)
> head(thread.weights)
                                      Thread Freq Response   Weight
1   please help a newbie compile mplayer :-)    4    42309 5.975627
2                 prob. w/ install/uninstall    4    23745 6.226488
3                       http://apt.nixia.no/   10   265303 5.576258
4         problems with 'apt-get -f install'    3    55960 5.729244
5                   problems with apt update    2     6347 6.498461
6 about apt, kernel updates and dist-upgrade    5   240238 5.318328

ã¾ãŸã€é€ä¿¡è€…ã§ã®é‡ã¿ã¥ã‘ã®è£œå®Œã¨ã—ã¦ã€ã€Œé€ä¿¡è€…ãŒä½•ã‚¹ãƒ¬ãƒƒãƒ‰ã«å‚åŠ ã—ã¦ã„ã‚‹ã‹ã€ã‚’ç¤ºã™é‡ã¿ã‚‚è¨ˆç®—ã—ã¦ãŠãã¾ã™ã€‚

> email.thread <- function(threads.matrix) {
  senders <- threads.matrix[, 1]
  senders.freq <- table(senders)
  senders.matrix <- cbind(names(senders.freq),
                          senders.freq,
                          log(senders.freq + 1))
  senders.df <- data.frame(senders.matrix, stringsAsFactors=FALSE)
  row.names(senders.df) <- 1:nrow(senders.df)
  names(senders.df) <- c("From.EMail", "Freq", "Weight")
  senders.df$Freq <- as.numeric(senders.df$Freq)
  senders.df$Weight <- as.numeric(senders.df$Weight)
  return(senders.df)
}
> senders.df <- email.thread(threads.matrix)
> head(senders.df)
                    From.EMail Freq    Weight
1            adam@homeport.org    1 0.6931472
2        aeriksson@fastmail.fm    5 1.7917595
3 albert.white@ireland.sun.com    1 0.6931472
4          alex@netwindows.org    1 0.6931472
5                andr@sandy.ru    1 0.6931472
6             andris@aernet.ru    1 0.6931472

ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªžã«ã‚ˆã‚‹é‡ã¿ã¥ã‘

ã¾ãšã¯ã€ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã€‚

ã‚¹ãƒ¬ãƒƒãƒ‰åã«å«ã¾ã‚Œã‚‹å˜èªžä¸€è¦§ã‚’æŠ½å‡ºã—ã¦ã€å˜èªžã”ã¨ã«é‡ã¿ã‚’è¨ˆç®—ã—ã¾ã™ã€‚
å˜èªžã‚’å«ã‚€å…¨ã‚¹ãƒ¬ãƒƒãƒ‰ã®weightã‚’å–ã‚Šå‡ºã—ã¦ã€ãã®å¹³å‡ã‚’é‡ã¿ã¨ã—ã¦ä½¿ã„ã¾ã™ã€‚

# å˜èªžã¨å‡ºç¾é »åº¦ã®ä¸€è¦§ã‚’è¿”ã™
> term.counts <- function(term.vec, control) {
  vec.corpus <- Corpus(VectorSource(term.vec))
  vec.tdm <- TermDocumentMatrix(vec.corpus, control = control)
  return(rowSums(as.matrix(vec.tdm)))
}

# ã‚¹ãƒ¬ãƒƒãƒ‰åã«å«ã¾ã‚Œã‚‹å˜èªžä¸€è¦§ã‚’æŠ½å‡º
> thread.terms <- term.counts(thread.weights$Thread, control = list(stopwords = TRUE))
> thread.terms <- names(thread.terms) # å‡ºç¾é »åº¦ã¯ä½¿ã‚ãªã„ã®ã§æ¨ã¦ã‚‹
> head(thread.terms)
[1] "--with"    ":-)"       "..."       ".doc"      "'apt-get"  "\"holiday"

# å˜èªžã”ã¨ã«é‡ã¿ã‚’ç®—å‡º
# å˜èªžã‚’å«ã‚€å…¨ã‚¹ãƒ¬ãƒƒãƒ‰ã®weightã‚’å–ã‚Šå‡ºã—ã¦ã€ãã®å¹³å‡ã‚’é‡ã¿ã¨ã—ã¦ä½¿ã† 
> term.weights <- sapply(thread.terms, 
  function(t) mean(thread.weights$Weight[grepl(t, thread.weights$Thread, fixed = TRUE)]))
> head(term.weights)
  --with      :-)      ...     .doc 'apt-get "holiday 
7.109579 6.103883 6.050786 5.725911 5.729244 7.197911 
# æ•´å½¢
> term.weights <- data.frame(list(Term = names(term.weights),
  Weight = term.weights), stringsAsFactors = FALSE, row.names = 1:length(term.weights))
> head(term.weights)
      Term   Weight
1   --with 7.109579
2      :-) 6.103883
3      ... 6.050786
4     .doc 5.725911
5 'apt-get 5.729244
6 "holiday 7.197911

æ¬¡ã«æœ¬æ–‡ã€‚

# æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªžã¨é »åº¦ã‚’é›†è¨ˆ
> msg.terms <- term.counts(priority.train$Message,
    control = list(stopwords = TRUE, removePunctuation = TRUE, removeNumbers = TRUE))
# é‡ã¿ã‚’ç®—å‡ºã€‚ã“ã“ã§ã‚‚å¯¾æ•°ã‚’ã¨ã‚‹
> msg.weights <- data.frame(list(Term = names(msg.terms), Weight = log(msg.terms, base = 10)), 
    stringsAsFactors = FALSE, row.names = 1:length(msg.terms))
# é‡ã¿ãŒã‚¼ãƒã®ã‚‚ã®ã¯é™¤å¤–
> msg.weights <- subset(msg.weights, Weight > 0)

ã“ã‚Œã§ã€ã™ã¹ã¦ã®é‡ã¿ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ãŒãã‚ã„ã¾ã—ãŸã€‚

é †ä½ã¥ã‘ã‚’è¡Œã†

# å˜èªžã®é‡ã¿ã‚’è¿”ã™
# å˜èªžã€æ¤œç´¢ã™ã‚‹é‡ã¿ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã€term.weightãŒæ¤œç´¢å¯¾è±¡ã‹ã©ã†ã‹ã€ã‚’å¼•æ•°ã§å—ã‘å–ã‚Šã€é‡ã¿ã‚’è¿”ã™ã€‚
> get.weights <- function(search.term, weight.df, term = TRUE) {
  if(length(search.term) > 0) {
    # weight.dfãŒterm.weightã‹ã©ã†ã‹ã§åˆ—åãŒç•°ãªã‚‹ã®ã§ã€ã“ã“ã§èª¿æ•´
    if(term) {
      term.match <- match(names(search.term), weight.df$Term)
    } else {
      term.match <- match(search.term, weight.df$Thread)
    }
    match.weights <- weight.df$Weight[which(!is.na(term.match))]
    if(length(match.weights) < 1) {
      # ãƒžãƒƒãƒã™ã‚‹ä»¶æ•°ãŒã‚¼ãƒã®å ´åˆã€1ã‚’ä½¿ã†
      return(1)
    } else {
      # ãƒžãƒƒãƒã™ã‚‹ä»¶æ•°ãŒ1ä»¥ä¸Šã®å ´åˆã€å¹³å‡ã‚’ä½¿ã†
      return(mean(match.weights))
    }
  } else {
    return(1)
  }
}

# ãƒ¡ãƒ¼ãƒ«ã®é‡è¦åº¦ã‚’è¿”ã™
> rank.message <- function(path) {
  
  # ãƒ¡ãƒ¼ãƒ«ã‚’è§£æž
  msg <- parse.email(path)
  
  # é€ä¿¡è€…ãŒé€ä¿¡ã—ãŸãƒ¡ãƒ¼ãƒ«æ•°ã«åŸºã¥ãé‡ã¿ã‚’å–å¾—
  from <- ifelse(length(which(from.weight$From.EMail == msg[2])) > 0,
                 from.weight$Weight[which(from.weight$From.EMail == msg[2])],
                 1)
  
  # é€ä¿¡è€…ãŒå‚åŠ ã—ãŸã‚¹ãƒ¬ãƒƒãƒ‰æ•°ã«åŸºã¥ãé‡ã¿ã‚’å–å¾—
  thread.from <- ifelse(length(which(senders.df$From.EMail == msg[2])) > 0,
                        senders.df$Weight[which(senders.df$From.EMail == msg[2])],
                        1)
  
  # ãƒ¡ãƒ¼ãƒ«ãŒã‚¹ãƒ¬ãƒƒãƒ‰ã¸ã®æŠ•é™ã‹ã©ã†ã‹ã‚’åˆ¤å®šã—ã€ã‚¹ãƒ¬ãƒƒãƒ‰ã¸ã®æŠ•ç¨¿ã§ã‚ã‚Œã°ã€ã‚¹ãƒ¬ãƒƒãƒ‰ã®é‡ã¿ã‚’å–å¾—
  subj <- strsplit(tolower(msg[3]), "re: ")
  is.thread <- ifelse(subj[[1]][1] == "", TRUE, FALSE)
  if(is.thread){
    activity <- get.weights(subj[[1]][2], thread.weights, term = FALSE)
  } else {
    # ã‚¹ãƒ¬ãƒƒãƒ‰ã¸ã®æŠ•ç¨¿ã§ãªã„å ´åˆã€é‡ã¿ã¯1
    activity <- 1
  }
  
  # ãƒ¡ãƒ¼ãƒ«ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã«åŸºã¥ãé‡ã¿ã‚’å–å¾—
  thread.terms <- term.counts(msg[3], control = list(stopwords = TRUE))
  thread.terms.weights <- get.weights(thread.terms, term.weights)
  
  # ãƒ¡ãƒ¼ãƒ«æœ¬æ–‡ã«åŸºã¥ãé‡ã¿ã‚’å–å¾—
  msg.terms <- term.counts(msg[4],
                           control = list(stopwords = TRUE,
                           removePunctuation = TRUE,
                           removeNumbers = TRUE))
  msg.weights <- get.weights(msg.terms, msg.weights)
  
  # é‡ã¿ã‚’ã™ã¹ã¦æŽ›ã‘åˆã‚ã›ã¦ã€é‡è¦åº¦ã‚’ç®—å‡ºã™ã‚‹
  rank <- prod(from,
               thread.from,
               activity, 
               thread.terms.weights,
               msg.weights)
  
  return(c(msg[1], msg[2], msg[3], rank))
}

å‹•ä½œãƒ†ã‚¹ãƒˆã€‚

> rank.message("../03-Classification/data/easy_ham/00111.a478af0547f2fd548f7b412df2e71a92")
[1] "Mon, 7 Oct 2002 10:37:26"                                
[2] "[email protected]"                                          
[3] "Re: [ILUG] Interesting article on free software licences"
[4] "5.27542087468428"

å„ªå…ˆãƒ¡ãƒ¼ãƒ«ã¨ã¿ãªã™é–¾å€¤ãŒå¦¥å½“ã‹ç¢ºèªã™ã‚‹

ä»Šå›žã¯ã€å„ªå…ˆåº¦ã®ä¸å¤®å€¤ã‚’é–¾å€¤ã¨ã—ã¦ä½¿ã„ã¾ã™ã€‚ ãƒ‡ãƒ¼ã‚¿ã®åŠåˆ†ã‚’ä½¿ã£ã¦ã€é–¾å€¤ãŒå¦¥å½“ã‹ãƒã‚§ãƒƒã‚¯ã—ã¾ã™ã€‚

train.paths <- priority.df$Path[1:(round(nrow(priority.df) / 2))]
test.paths <- priority.df$Path[((round(nrow(priority.df) / 2)) + 1):nrow(priority.df)]

# train.pathsã«å«ã¾ã‚Œã‚‹ãƒ¡ãƒ¼ãƒ«ã®é‡è¦åº¦ã‚’ç®—å‡º
train.ranks <- suppressWarnings(lapply(train.paths, rank.message))
# ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«å¤‰æ›
> train.ranks.matrix <- do.call(rbind, train.ranks)
> train.ranks.matrix <- cbind(train.paths, train.ranks.matrix, "TRAINING")
> train.ranks.df <- data.frame(train.ranks.matrix, stringsAsFactors = FALSE)
> names(train.ranks.df) <- c("Message", "Date", "From", "Subj", "Rank", "Type")
> train.ranks.df$Rank <- as.numeric(train.ranks.df$Rank)
> head(train.ranks.df)
                                                                    Message
1 ../03-Classification/data/easy_ham/01061.6610124afa2a5844d41951439d1c1068
2 ../03-Classification/data/easy_ham/01062.ef7955b391f9b161f3f2106c8cda5edb
3 ../03-Classification/data/easy_ham/01063.ad3449bd2890a29828ac3978ca8c02ab
4 ../03-Classification/data/easy_ham/01064.9f4fc60b4e27bba3561e322c82d5f7ff
5 ../03-Classification/data/easy_ham/01070.6e34c1053a1840779780a315fb083057
6 ../03-Classification/data/easy_ham/01072.81ed44b31e111f9c1e47e53f4dfbefe3
                       Date                   From
1 Thu, 31 Jan 2002 22:44:14  robinderbains@shaw.ca
2      01 Feb 2002 00:53:41 lance_tt@bellsouth.net
3 Fri, 01 Feb 2002 02:01:44  robinderbains@shaw.ca
4  Fri, 1 Feb 2002 10:29:23      matthias@egwn.net
5  Fri, 1 Feb 2002 12:42:02     bfrench@ematic.com
6  Fri, 1 Feb 2002 13:39:31     bfrench@ematic.com
                                          Subj       Rank     Type
1     Please help a newbie compile mplayer :-)   3.614003 TRAINING
2 Re: Please help a newbie compile mplayer :-) 120.742481 TRAINING
3 Re: Please help a newbie compile mplayer :-)  20.348502 TRAINING
4 Re: Please help a newbie compile mplayer :-) 307.809626 TRAINING
5                   Prob. w/ install/uninstall   3.653047 TRAINING
6               RE: Prob. w/ install/uninstall  21.685750 TRAINING

é–¾å€¤ã‚’ä¸å¤®å€¤ã«è¨å®šã—ã¦ã€è¨“ç·´ãƒ‡ãƒ¼ã‚¿ã®é‡è¦åº¦ã¨å¯†åº¦ã‚’å›³ã«ã—ã¾ã™ã€‚

# é–¾å€¤ã‚’ä¸å¤®å€¤ã«è¨å®š
> priority.threshold <- median(train.ranks.df$Rank)

# è¨“ç·´ãƒ‡ãƒ¼ã‚¿ã®é‡è¦åº¦ã¨å¯†åº¦ã‚’å›³ç¤º
> threshold.plot <- ggplot(train.ranks.df, aes(x = Rank)) +
  stat_density(aes(fill="darkred")) +
  geom_vline(xintercept = priority.threshold, linetype = 2) +
  scale_fill_manual(values = c("darkred" = "darkred"), guide = "none") +
  theme_bw()
> ggsave(plot = threshold.plot,
       filename = file.path("images", "01_threshold_plot.png"),
       height = 4.7,
       width = 7)

f:id:unageanu:20160112182342p:plain

å›³ä¸ã®ç‚¹ç·šãŒä¸å¤®å€¤ã€‚ ã“ã“ã‚’é–¾å€¤ã«ã™ã‚Œã°ã€ãƒ©ãƒ³ã‚¯ã®é«˜ã„è£¾éƒ¨åˆ†ã¨ã€å¯†åº¦ã®é«˜ã„éƒ¨åˆ†ã®é›»åãƒ¡ãƒ¼ãƒ«ã‚‚ã‚ã‚‹ç¨‹åº¦å«ã¾ã‚Œã‚‹ã®ã§ã€ã“ã‚Œã‚‰ã‚’å„ªå…ˆãƒ¡ãƒ¼ãƒ«ã¨åˆ¤å®šã—ãŸã®ã§ã‚ˆã•ãã†ã€‚

æ®‹ã‚Šã®ãƒ‡ãƒ¼ã‚¿ã‚‚åŠ ãˆã¦ã€å›³ã«ã—ã¦ã¿ã¾ã™ã€‚

# test.ranksã«å«ã¾ã‚Œã‚‹ãƒ¡ãƒ¼ãƒ«ã®é‡è¦åº¦ã‚’ç®—å‡º
> train.ranks.df$Priority <- ifelse(train.ranks.df$Rank >= priority.threshold, 1, 0)
> test.ranks <- suppressWarnings(lapply(test.paths,rank.message))
> test.ranks.matrix <- do.call(rbind, test.ranks)
> test.ranks.matrix <- cbind(test.paths, test.ranks.matrix, "TESTING")
> test.ranks.df <- data.frame(test.ranks.matrix, stringsAsFactors = FALSE)
> names(test.ranks.df) <- c("Message","Date","From","Subj","Rank","Type")
> test.ranks.df$Rank <- as.numeric(test.ranks.df$Rank)
> test.ranks.df$Priority <- ifelse(test.ranks.df$Rank >= priority.threshold, 1, 0)

# è¨“ç·´ç”¨ãƒ‡ãƒ¼ã‚¿ã¨ãƒ†ã‚¹ãƒˆç”¨ãƒ‡ãƒ¼ã‚¿ã‚’ãƒžãƒ¼ã‚¸
> final.df <- rbind(train.ranks.df, test.ranks.df)
> final.df$Date <- date.converter(final.df$Date, pattern1, pattern2)
> final.df <- final.df[rev(with(final.df, order(Date))), ]
> head(final.df)
                                                                       Message
2500 ../03-Classification/data/easy_ham/00883.c44a035e7589e83076b7f1fed8fa97d5
2499 ../03-Classification/data/easy_ham/02500.05b3496ce7bca306bed0805425ec8621
2498 ../03-Classification/data/easy_ham/02499.b4af165650f138b10f9941f6cc5bce3c
2497 ../03-Classification/data/easy_ham/02498.09835f512f156da210efb99fcc523e21
2496 ../03-Classification/data/easy_ham/02497.60497db0a06c2132ec2374b2898084d3
2495 ../03-Classification/data/easy_ham/02496.aae0c81581895acfe65323f344340856
     Date                    From
2500 <NA>             sdw@lig.net
2499 <NA> ilug_gmc@fiachra.ucd.ie
2498 <NA>          mwh@python.net
2497 <NA>            nickm@go2.ie
2496 <NA>       phil@techworks.ie
2495 <NA>           timc@2ubh.com
                                                      Subj      Rank    Type
2500                                       Re: ActiveBuddy  6.219744 TESTING
2499                              Re: [ILUG] Linux Install  2.278890 TESTING
2498 [Spambayes] Re: New Application of SpamBayesian tech?  4.265954 TESTING
2497                              Re: [ILUG] Linux Install  4.576643 TESTING
2496                              Re: [ILUG] Linux Install  3.652100 TESTING
2495                          [zzzzteana] Surfing the tube 27.987331 TESTING
     Priority
2500        0
2499        0
2498        0
2497        0
2496        0
2495        1

# å›³ç¤º
> testing.plot <- ggplot(subset(final.df, Type == "TRAINING"), aes(x = Rank)) +
  stat_density(aes(fill = Type, alpha = 0.65)) +
  stat_density(data = subset(final.df, Type == "TESTING"),
               aes(fill = Type, alpha = 0.65)) +
  geom_vline(xintercept = priority.threshold, linetype = 2) +
  scale_alpha(guide = "none") +
  scale_fill_manual(values = c("TRAINING" = "darkred", "TESTING" = "darkblue")) +
  theme_bw()
> ggsave(plot = testing.plot,
       filename = file.path("images", "02_testing_plot.png"),
       height = 4.7,
       width = 7)

f:id:unageanu:20160112182343p:plain

ãƒ†ã‚¹ãƒˆãƒ‡ãƒ¼ã‚¿ã¯ã€è¨“ç·´ãƒ‡ãƒ¼ã‚¿ã‚ˆã‚Šå„ªå…ˆåº¦ä½Žã®ãƒ¡ãƒ¼ãƒ«ãŒå¤šãå«ã¾ã‚Œã‚‹çµæžœã«ãªã£ã¦ã„ã¾ã™ã€‚ ã“ã‚Œã¯ã€ãƒ†ã‚¹ãƒˆãƒ‡ãƒ¼ã‚¿ã®ç´ æ€§ã«ã€è¨“ç·´ãƒ‡ãƒ¼ã‚¿ã«å«ã¾ã‚Œãªã„ãƒ‡ãƒ¼ã‚¿ãŒå¤šãå«ã¾ã‚Œã€ã“ã‚Œã‚‰ãŒé †åºä»˜ã‘æ™‚ã«ç„¡è¦–ã•ã‚Œã¦ã„ã‚‹ãŸã‚ã§ã‚ã‚Šã€å¦¥å½“ã‚‰ã—ã„ã€‚ãµã‚€ã€‚

æœ€å¾Œã«å„ªå…ˆåº¦ä¸€è¦§ã‚’csvã«å‡ºåŠ›ã—ã¦ãŠã—ã¾ã„ã€‚

write.csv(final.df, file.path("data", "final_df.csv"), row.names = FALSE)

ã†ãªã®æ—¥è¨˜

æ©Ÿæ¢°å¦ç¿’æ‰‹ç¿’ã„: é‡è¦åº¦ã«ã‚ˆã‚‹é›»åãƒ¡ãƒ¼ãƒ«ã®ä¸¦ã³æ›¿ãˆ

ä¸¦ã³æ›¿ãˆã®ã‚¢ãƒ—ãƒãƒ¼ãƒ

å¿…è¦ãªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã¨ãƒ‡ãƒ¼ã‚¿ã®èªã¿è¾¼ã¿

ãƒ¡ãƒ¼ãƒ«ã‹ã‚‰ç´ æ€§ã‚’å–ã‚Šå‡ºã™

ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‚’èªã¿è¾¼ã‚“ã§ç´ æ€§ã‚’ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«ã¾ã¨ã‚ã‚‹

ãƒ‡ãƒ¼ã‚¿ã®èª¿æ•´

é€ä¿¡è€…åˆ¥ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã§ã®é‡ã¿ã¥ã‘

ã‚¹ãƒ¬ãƒƒãƒ‰æ´»æ€§ã§ã®é‡ã¿ã¥ã‘

ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªžã«ã‚ˆã‚‹é‡ã¿ã¥ã‘

é †ä½ã¥ã‘ã‚’è¡Œã†

å„ªå…ˆãƒ¡ãƒ¼ãƒ«ã¨ã¿ãªã™é–¾å€¤ãŒå¦¥å½“ã‹ç¢ºèªã™ã‚‹

ä¸¦ã³æ›¿ãˆã®ã‚¢ãƒ—ãƒ­ãƒ¼ãƒ

å¿…è¦ãªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã¨ãƒ‡ãƒ¼ã‚¿ã®èª­ã¿è¾¼ã¿

ãƒ¡ãƒ¼ãƒ«ã‹ã‚‰ç´ æ€§ã‚’å–ã‚Šå‡ºã™

ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‚’èª­ã¿è¾¼ã‚“ã§ç´ æ€§ã‚’ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«ã¾ã¨ã‚ã‚‹

ãƒ‡ãƒ¼ã‚¿ã®èª¿æ•´

é€ä¿¡è€…åˆ¥ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã§ã®é‡ã¿ã¥ã‘

ã‚¹ãƒ¬ãƒƒãƒ‰æ´»æ€§ã§ã®é‡ã¿ã¥ã‘

ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªžã«ã‚ˆã‚‹é‡ã¿ã¥ã‘

é †ä½ã¥ã‘ã‚’è¡Œã†

å„ªå…ˆãƒ¡ãƒ¼ãƒ«ã¨ã¿ãªã™é–¾å€¤ãŒå¦¥å½“ã‹ç¢ºèªã™ã‚‹

ä¸¦ã³æ›¿ãˆã®ã‚¢ãƒ—ãƒãƒ¼ãƒ

å¿…è¦ãªãƒ¢ã‚¸ãƒ¥ãƒ¼ãƒ«ã¨ãƒ‡ãƒ¼ã‚¿ã®èªã¿è¾¼ã¿

ãƒ¡ãƒ¼ãƒ«ã‹ã‚‰ç´ æ€§ã‚’å–ã‚Šå‡ºã™

ãƒ¡ãƒ¼ãƒ«ãƒ‡ãƒ¼ã‚¿ã‚’èªã¿è¾¼ã‚“ã§ç´ æ€§ã‚’ãƒ‡ãƒ¼ã‚¿ãƒ•ãƒ¬ãƒ¼ãƒ ã«ã¾ã¨ã‚ã‚‹

ãƒ‡ãƒ¼ã‚¿ã®èª¿æ•´

é€ä¿¡è€…åˆ¥ãƒ¡ãƒ¼ãƒ«ä»¶æ•°ã§ã®é‡ã¿ã¥ã‘

ã‚¹ãƒ¬ãƒƒãƒ‰æ´»æ€§ã§ã®é‡ã¿ã¥ã‘

ã‚µãƒ–ã‚¸ã‚§ã‚¯ãƒˆã¨æœ¬æ–‡ã«å«ã¾ã‚Œã‚‹å˜èªžã«ã‚ˆã‚‹é‡ã¿ã¥ã‘

é †ä½ã¥ã‘ã‚’è¡Œã†

å„ªå…ˆãƒ¡ãƒ¼ãƒ«ã¨ã¿ãªã™é–¾å€¤ãŒå¦¥å½“ã‹ç¢ºèªã™ã‚‹