ã¯ã¦ãªããã¯ãã¼ã¯è¨äºã®ã¬ã³ã¡ã³ãã·ã¹ãã ãä½æãPythonã«ããã¯ã¦ãªAPIã®æ´»ç¨ã¨Rã«ããã¢ãã«ãã¼ã¹ã¬ã³ã¡ã³ã
ç§ã¯æ å ±åéã«ã¯ã¦ãªããã¯ãã¼ã¯ãå¤ç¨ãã¦ãããæãªæã¯çµæ§ãªå²åã§ã¯ã¦ãªããã¯ãã¼ã¯ã§è¨äºãæ¢ãã¦ã¾ããããããã¯ã¦ãªããã¯ãã¼ã¯ã¯ææ°ã®è¨äºãæ¢ãã®ã¯ä¾¿å©ã§ãããéå»ã®è¨äºãæ¢ãã«ã¯ãã¾ãã¡ä½¿ãã¾ãããå人çã«ã¯å¤å°éå»ã®è¨äºã§ãèªåãèå³ãæã£ã¦ããåéã«é¢ãã¦ã¯ãã¬ã³ã¡ã³ããã¦æ¬²ããã¨æãã¦ã¾ãã
ããããããã¨ã«ã¯ã¦ãªã¯APIãå ¬éãã¦ãããã¯ã¦ãªããã¯ãã¼ã¯ã®æ å ±ãæ¯è¼çç°¡åã«åå¾ã§ãã¾ããããã§ãã®APIãå©ç¨ãã¦èªåã«åã£ãè¨äºãè¦ã¤ãããããªã¬ã³ã¡ã³ãæ©è½ãRã¨Pythonã§ä½æãã¦ã¿ããã¨æãã¾ãã
å©ç¨ãããã¼ã¿ã¯ãはてなAPIã使ã£ã¦åéãã¾ããå ·ä½çã«ã¯ãã¯ã¦ãªããã¯ãã¼ã¯ãã£ã¼ããå©ç¨ãã¦èªåã®ããã¯ãã¼ã¯ãã¦ããURLãåå¾ãããã®URLãããã¯ãã¼ã¯ãã¦ããã¦ã¼ã¶ãã¨ã³ããªã¼æ å ±åå¾APIãç¨ãã¦æ½åºãããã®ã¦ã¼ã¶ã®ããã¯ãã¼ã¯ãã¦ããURLãåéãã¾ãããã®userâããã¯ãã¼ã¯URLâuserâããã¯ãã¼ã¯URLã¨ããæé ãç¹°ãè¿ãã°å¤§éã«URLãåéãããã¨ãã§ãã¾ãããä»åã®ç®çã¯èªåã«åããããªè¨äºãè¦ã¤ãããã¨ãªã®ã§ãæ¢ç´¢ç¯å²ã¯èªåã¨åãè¨äºãããã¯ãã¼ã¯ãã¦ããã¦ã¼ã¶ã«éãã¾ãã
ãªãã¹ãçã¨ãã§ããããã®ã§ãPythonãç¨ãã¦è¨äºãåéããRãç¨ãã¦ãã¼ã¿å å·¥ã¨è¨äºã®ã¹ã³ã¢ãªã³ã°ãè¡ãã¾ããã·ã¹ãã æ§ç¯ã¨ããé¢ã§ã¯Pythonã®ã¿ã§ä½ã£ãæ¹ãè¯ããããããªãã§ããããã¼ã¿å å·¥ãã¢ãã«æ§ç¯ã«é¢ãã¦ã¯Rã®æ¹ãå人çã«æ £ãã¦ããã®ã§ã
ã¾ãä½ãããã«ãã¦ããã¼ã¿ãå¿ è¦ã¨ãããã¨ã§ãPythonãç¨ãã¦URLãåéãã¾ãããã¼ã¿åéã«å©ç¨ããã®ã¯ä»¥ä¸ã®ã³ã¼ãã§ãã
#!/usr/bin/python # -*- coding: utf-8 -*- import sys,re,time import csv import urllib2 import json import feedparser reload(sys) sys.setdefaultencoding("utf-8") def main(): # Webæ å ±åå¾ã®æºå opener = urllib2.build_opener() user = "Overlap" # ã¯ã¦ãªuser_idãæå® # ã¯ã¦ãªããã¯ãã¼ã¯ã®feedæ å ±ã®åå¾ url_list = [] id = 0 for i in range(0,200,20): feed_url = "http://b.hatena.ne.jp/" + user + "/rss?of=" + str(i) # ã¯ã¦ãªAPIã«æ¸¡ãã¯ã¨ãªã®ä½æ try: response = opener.open(feed_url) # urlãªã¼ãã³ except: continue content = response.read() # feedæ å ±ã®åå¾ feed = feedparser.parse(content) # feedãã¼ãµãç¨ãã¦feedã解æ # entriesããªãå ´åbreak if feed["entries"] == []: break # urlãªã¹ãã®ä½æ for e in feed["entries"]: try: url_list.append([id,e["link"],user,e["hatena_bookmarkcount"],re.sub("[,\"]","",e["title"])]) # url_listã®ä½æï¼titleã®ã«ã³ãã¨ããã«ã¯ã©ã¼ãã¼ã·ã§ã³ãç½®æï¼ id += 1 except: pass time.sleep(0.05) # ã¢ã¯ã»ã¹é度ã®å¶å¾¡ # 対象urlãããã¯ãã¼ã¯ãã¦ããã¦ã¼ã¶ã®æ½åº user_list = [] for i, url in enumerate(url_list): response = opener.open("http://b.hatena.ne.jp/entry/jsonlite/" + url[1]) # ã¯ã¦ãªAPIã«ããããã¯ãã¼ã¯æ å ±ã®åå¾ content = response.read() tmp = json.loads(content) # jsonã®è§£æ # userãªã¹ãã®ä½æ for b in tmp["bookmarks"]: user_list.append([url[0],b["user"]]) time.sleep(0.05) # ã¢ã¯ã»ã¹é度ã®å¶å¾¡ # èªåã¨åãurlãããã¯ãã¼ã¯ãã¦ããæ°ãéè¨ count_user = {} for i, (id,uname) in enumerate(user_list): if count_user.has_key(uname): count_user[uname] += 1 else: count_user[uname] = 1 # ããã¯ãã¼ã¯æ°ä¸ä½ã®ã¦ã¼ã¶ã®ããã¯ãã¼ã¯urlæ å ±ãåå¾ for uname, count in sorted(count_user.items(), key=lambda x:x[1],reverse=True): print uname, count if uname == user: continue # èªåã®idã¯é¤ã # ç´è¿200件ã®ããã¯ãã¼ã¯urlãåå¾ for i in range(0,200,20): try: feed_url = "http://b.hatena.ne.jp/" + uname + "/rss?of=" + str(i) # feedåå¾ç¨ã¯ã¨ãª except: continue response = opener.open(feed_url) # feedæ å ±ã®åå¾ content = response.read() feed = feedparser.parse(content) # feedæ å ±ã®è§£æ if feed["entries"] == []: break for e in feed["entries"]: if [e["link"],uname] in [ [tmp[1],tmp[2]] for tmp in url_list]: continue # éå»ã«åå¾ããæ å ±ã¯é¤ã try: url_list.append([id,e["link"],uname,e["hatena_bookmarkcount"],re.sub("[,\"]","",e["title"])]) id += 1 except: pass time.sleep(0.05) # ã¢ã¯ã»ã¹é度ã®å¶å¾¡ if count < 10: break # åãããã¯ãã¼ã¯æ°ã10ããå°ãªãå ´åbreak print len(url_list) # ãã¡ã¤ã«ã®åºå ofname = "url_list.csv" fout = open(ofname,"w") writer = csv.writer(fout,delimiter=",") writer.writerow(["id","url","user","count","title"]) for t in url_list: writer.writerow(t) fout.close() if __name__ == "__main__": main()
ãã¾ã使ãåãäºå®ã¯ãªãã®ã§mainã®ä¸ã«ã¹ãæ¸ããã¦ãããã¨ã©ã¼å¦çã¨ãé©å½ã§ããã¯ã¦ãªãã£ã¼ãã¯feedparserãç¨ãã¦ãã¼ã¹ãã¦ãã¾ããfeedparserã«é¢ãã¦ã¯ä»¥ä¸ã®ãã¼ã¸ãåããããããã¨æãã¾ãã
ãhttp://python.g.hatena.ne.jp/muscovyduck/20081221/p1
ã¾ããã¯ã¦ãªããã¯ãã¼ã¯ã¨ã³ããªæ
å ±åå¾APIã¯JSONå½¢å¼ã§åå¾ãããã®ã§ãjsonã©ã¤ãã©ãªãå©ç¨ãã¦è§£æãã¦ã¾ããjsonå½¢å¼ã®èªã¿è¾¼ã¿ã«é¢ãã¦ã¯ä»¥ä¸ã®ãã¼ã¸ãåèã«ãªãã¨æãã¾ãã
ãhttp://tmlife.net/programming/python/python-json-module.html
ãã¨APIãé«éã§å©ãã®ã¯ãµã¼ãè² è·çã«è¯ããªãã¨æãã®ã§ã0.05ç§ç¨åº¦ã®éãç½®ãããã«ãã¦ã¾ãã
ä¸è¨ã®ã³ã¼ãã§ã¯ãèªåã¨åãããã¯ãã¼ã¯æ°ã10以ä¸ã®ã¦ã¼ã¶ã®ããã¯ãã¼ã¯URLã200件ã¾ã§åå¾ãã¦ãã¾ãããã®è¾ºã®ãã©ã¡ã¼ã¿ã¯ç²¾åº¦ããã¼ã¿éã¨ã®å
¼ãåãã§å¤æ´ããæ¹ãããããããã¾ããã
URLæ
å ±ãåå¾ã§ããã®ã§ã次ã¯Rãç¨ãã¦ä¸è¨æ
å ±ããã¢ãã«ãä½æãã¬ã³ã¡ã³ãURLãæ½åºãã¾ãã
Rã§ã¯ããã¼ã¿ãurlÃuserã®è¡åã«æ´å½¢ããèªåãããã¯ãã¼ã¯ããurlãå
ã«ã¹ã³ã¢ãªã³ã°ãè¡ãã¾ãããã¼ã¿ã®æ´å½¢ã«ã¯reshape2ãç¨ãã¦ãpivotãã¼ãã«ãä½æãããã¨ã«ãã¾ããã¹ã³ã¢ãªã³ã°ã¨ãã¦ã¯ãåç´ãªå調ãã£ã«ã¿ãªã³ã°ã§ã¯çãªãã¼ã¿ã®å ´åã¬ã³ã¡ã³ãã§ããè¨äºãéãããã®ã§ãã¢ãã«ãã¼ã¹ã®ãã£ã«ã¿ãªã³ã°ãè¡ããã¨ã«ãã¾ãã
ã¢ã«ã´ãªãºã ã¨ãã¦ã¯ã©ã³ãã ãã©ã¬ã¹ããå©ç¨ãã¾ããé¸æã®çç±ã¨ãã¦ã¯ããã©ã¡ã¼ã¿ãã¥ã¼ãã³ã°ãªãã§ãããããã®ç²¾åº¦ãåºãç¹ã¨å¤æ°ã®è²¢ç®åº¦ï¼ä»åã®å ´åã¯ä¼¼ã¦ããã¦ã¼ã¶ï¼ããããããã§ããã¡ãªã¿ã«Windowsã®å ´åãæåã³ã¼ãã¨ã©ã¼ãèµ·ããå ´åãããã¾ããããã®å ´åã¯url_list.csvãã¨ãã£ã¿ãªã©ã§shift-jisã«å¤æããã°è¯ãã§ãããã
以ä¸ãå©ç¨ããã³ã¼ãã§ãã
library(reshape2) library(randomForest) user <- "Overlap" # èªåã®ã¯ã¦ãªid # ãã¼ã¿ã®èªã¿è¾¼ã¿ url_list <- read.csv("url_list.csv",head=T,sep=",") tmp <- unique(url_list[,c(2,5)]) # urlã¨titleãä¿å tmp[,2] <- substr(tmp[,2],1,50) # ã¿ã¤ãã«ã®æåæ°ã50åã¾ã§ã« # ãããããã¼ãã«ã«ãããã¼ã¿æ´å½¢ dat <- dcast(melt(url_list,id.vars=c("url","user"),measure.vars="id"),url~user,length) url <- dat[,1] # urlã®ä¿å dat <- dat[,-1] # urlã®é¤å¤ target <- dat[,colnames(dat) == user] # userè¡ã®ä¿å dat <- dat[,colnames(dat) != user] # userè¡ã®é¤å¤ users <- colnames(dat) # useråã®ä¿å colnames(dat) <- paste(rep("user",ncol(dat)),1:ncol(dat),sep="") # ååã®å¤æ´ dat <- data.frame(dat,target) # targetã®è¿½å # randomForestãå©ç¨ããã¢ãã«æ§ç¯ dat.rf <- randomForest(factor(target)~.,data=dat) pred.rf <- predict(dat.rf,dat,type="prob")[,2] # ã¢ãã«ã®é©ç¨ rank <- data.frame(url,dat$target,pred.rf) rank <- merge(rank,tmp,by="url",all.x=T) # titleã®è¿½å rank <- rank[order(rank$pred.rf,decreasing=T),] rank <- rank[rank$dat.target==0,c(4,1,3)] # éããã¯ãã¼ã¯ã®ã¿ã«éå® rownames(rank) <- 1:nrow(rank) write.csv(rank,"rank.csv") # ãã¼ã¿ã®åºå # è²¢ç®åº¦ã®ç¢ºèª sim_user <- data.frame(users,varImpPlot(dat.rf)) head(sim_user[order(sim_user$MeanDecreaseGini,decreasing=T),],20)
åºåçµæï¼ä¸ä½10件ï¼
"","title","url","pred.rf" "1","第19åããã¸ã¹ãã£ãã¯å帰ã®å¦ç¿ï¼æ©æ¢°å¦ç¿ ã¯ããããï½gihyo.jp ⦠æè¡è©è«ç¤¾","http://gihyo.jp/dev/serial/01/machine-learning/0019",0.214 "2","SEãç¥ã£ã¦ãããããã¼ã¿ãµã¤ã¨ã³ã¹ - è¡åäºæ¸¬ãæ´»ç¨ããCRMã·ã¹ãã ã®æ´»ç¨æ³ã¨è¦ä»¶ï¼ITpro","http://itpro.nikkeibp.co.jp/article/COLUMN/20130507/475061/",0.2 "3","ããã¼ã¿åæã¨ã½ããã¦ã¨ã¢ã®ä¼ç¤¾ã«ãªãã¾ãã?ã¸ã§ãï½¥ã¤ã¡ã«ãæ°ã»ç±³ã¼ãã©ã«ã»ã¨ã¬ã¯ããªãã¯ï¼GEï¼","http://itpro.nikkeibp.co.jp/article/COLUMN/20130604/482084/",0.182 "4","NTTãã稼ãããç 究æã¸æ°çµç¹ã人工ç¥è½ã§ããã°ãã¼ã¿æè¡ãåçåããï¼æ¥æ¬çµæ¸æ°è","http://www.nikkei.com/article/DGXZZO55804220T00C13A6000000/",0.174 "5","Rã¨Pythonã«ãããã¼ã¿è§£æå ¥é","http://www.slideshare.net/gepuro/rpython",0.148 "6","æ©æ¢°å¦ç¿ãã¥ã¼ããªã¢ã«@Jubatus Casual Talks","http://www.slideshare.net/unnonouno/jubatus-casual-talks",0.148 "7","ãã²ã¼ã ã¨Twitterã¨Facebookããããªããªãã¦ãã£ãããªãããGunosyéçºãã¼ã æ ¹æ","http://gigazine.net/news/20130417-gunosy/",0.144 "8","Rçµ±è¨è§£æå ¥éï¼ çµ±è¨è§£æããã¯ãã«ã«ãã¼ã¿ãã¬ã¼ã³ãã¼ã·ã§ã³ã 梶山ãåä¸é","http://monge.tec.fukuoka-u.ac.jp/R_analysis/0r_analysis.html#cross_table",0.144 "9","æ å ±å¦ç 究ãã¼ã¿ãªãã¸ã㪠ãã³ãã³åç»ã³ã¡ã³ãçãã¼ã¿","http://www.nii.ac.jp/cscenter/idr/nico/nico.html",0.142 "10","ãã¼ã¿ãµã¤ã¨ã³ãã£ã¹ããç®æãã«å½ãã£ã¦ããã²æãã¦ããããããã¹ããã¡ãæãã¦ã¿ã - éçåã§å","http://tjo.hatenablog.com/entry/2013/05/07/191000",0.138
è¦äºã«ãã¼ã¿ãã¤ãã³ã°ã»ããã°ãã¼ã¿ç³»ã®è¨äºã°ã£ããã§ãããç§ã®èå³ã»é¢å¿ãåæ ãã¦ãã¦ãå人çã«ã¯ããªãæºè¶³ã§ããçµæã«ãªã£ã¦ã¾ããã¯ã¦ãªAPIã¯ã¿ã°æ å ±ãªã©ãåå¾ã§ããã®ã§ãã¿ã°æ å ±ãã¿ã¤ãã«ã®ããã¹ã解ææ å ±ãªã©ãç¨ããã°ãã精度åä¸ãè¦è¾¼ããã¨æãã¾ããã¾ããè¦ããã©ããã¯ãã¼ã¯ããªãã£ãè¨äºãå¤æ°å«ã¾ãã¦ãããã§ãããé²è¦§å±¥æ´ããªãã¨ããã«å¯¾å¿ããã®ã¯é£ããã§ããã趣å³ã¬ãã«ã®è©¦ã¿ã§ãããç¾ç¶ã§ã¯ããã§ååãã¨æãã¾ãã
ã¡ãªã¿ã«è²¢ç®åº¦ãé«ãã£ãã¦ã¼ã¶ã¯ä»¥ä¸ã®éãã§ããã
users MeanDecreaseGini user95 TohgorohMatsui 3.635295 user37 irisu22001 3.068073 user115 yokkuns 2.462939 user48 Keiku 1.886593 user61 moa108 1.737433 user5 asa6008885 1.734965 user76 s-feng 1.731526 user116 yoshia_e 1.723635 user102 TYK 1.573865 user34 ilford400 1.537937
è²¢ç®åº¦ãé«ããã¨ã¯å¿ ãããé¡ä¼¼åº¦ãé«ãã¨ããããã§ã¯ãªãã§ãããèªåã¨èå³ãè¿ãã¦ã¼ã¶åè£ããããã®ããã®åæã®é¢ç½ãç¹ã§ã¯ãªãã§ããããã
ã¯ã¦ãªããã¯ãã¼ã¯ã¯èªç±ã«ä½¿ãããã¼ã¿ã¨ãã¦é常ã«é¢ç½ããã®ã ã¨æãã®ã§ãä»å¾ãä½ãé¢ç½ããã¨ãã§ããªããèãã¦ããããã§ããã