以åãã¹ã¯ã¬ã¤ãã³ã°ã§ãããã³ããªãåå¾ããè¨äºãæ¸ãã¾ããã
ãã®è¨äºã«ãRSSãã£ã¼ãã®åå¾ãªããã¡ãã§ãã£ãæ¹ããããã¨ããæè¦ããã£ãã®ã§ã調ã¹ã¦ã³ã¼ããæ¸ããªããã¦ã¿ã¾ããã
- ã¯ã¦ãã®RSSãã£ã¼ã
- feedparser
- æ´åãã¦csvåºåã¾ã§è¡ãã³ã¼ã
- ã¾ã¨ã
- åèãªã³ã¯
- ãããªè¨äºãæ¸ãã¦ãã¾ã
ã¯ã¦ãã®RSSãã£ã¼ã
åã«ãã´ãªã¼ã®ãã¼ã¸æ«å°¾ã«.rss
ã追å ããã°ããããã§ãã
ãç·åãã®ãããã¨ã³ããªã¼ã®å ´åã以ä¸ã®ã¨ããã§ãã
http://b.hatena.ne.jp/hotentry.rss
RSSãã£ã¼ããæ¤åºããæ©è½ããã©ã¦ã¶ã«ãããã¨ããã®ããè¦ãã¦ããã¨å½¹ã«ç«ã¤ããããã¾ããã
feedparser
RSSã®è§£æã«ã¯feedparserã便å©ããã§ãã
ã¾ãã¯ããã±ã¼ã¸ãã¤ã³ã¹ãã¼ã«ãã¾ãã
pip install feedparser
次ã«ã³ã¼ãã§ãã
feedparser
ãimportãã¦ãRSSã®URLãparseãã¦ããã¾ãã
å
¥ã£ãdictionaryã®ä¸èº«ãprint
ã§è¦ãããããã¥ã¡ã³ãã§ç¢ºèªããããã¾ããã
æçµçã«æ¬²ããã£ããã¯ã¦ããã¿ã¤ãã«ããªã³ã¯ã¯ä¸è¨ã®è¦é ã§åå¾ã§ãã¾ããã
import feedparser RSS_URL = "http://b.hatena.ne.jp/hotentry.rss" hatebu_dic = feedparser.parse(RSS_URL) for x in hatebu_dic.entries: hbm_count = x.hatena_bookmarkcount title = x.title link = x.link print(hbm_count, title, link)
çµæã¯ãããªæãã§ãã
ãããªãæãããªãããªï¼
æ´åãã¦csvåºåã¾ã§è¡ãã³ã¼ã
ä»åã¯ã¯ã¦ãç·åãããªãã¦ãèªåãèå³ã®ããåéã®ãããã¨ã³ããªã¼ãåå¾ãã¦ã¾ãã
import feedparser import csv # ãããã¨ã³ããªRSSã®åå¾ã解æ # ãç·åã RSS_URL = "http://b.hatena.ne.jp/hotentry.rss" it = "http://b.hatena.ne.jp/hotentry/it.rss" manabi = "http://b.hatena.ne.jp/hotentry/knowledge.rss" kurashi = "http://b.hatena.ne.jp/hotentry/life.rss" yononaka = "http://b.hatena.ne.jp/hotentry/social.rss" rss = [it, manabi, kurashi, yononaka] hotentry = [] # ã¯ã¦ãæ°ãã¿ã¤ãã«ããªã³ã¯ãæ ¼ç´ for n in rss: hatebu_dic = feedparser.parse(n) for x in hatebu_dic.entries: hbm_count = x.hatena_bookmarkcount title = x.title link = x.link hotentry.append((hbm_count, title, link )) # ã¯ã¦ãæ°ã§ã½ã¼ã hotentry = sorted(hotentry, key=lambda x:int(x[0]), reverse=True) # 確èªç¨ã«è¡¨ç¤º for x in hotentry: print('{} || {} \n {}'.format(x[0], x[1], x[2])) # csvã«åºå f = open('hatebu_rss.csv', 'w', encoding='CP932', errors='ignore') writer = csv.writer(f, lineterminator='\n') for x in hotentry: writer.writerow(x) f.close()
ã¾ã¨ã
beautifulsoup4ã使ã£ããã®ãããRSSãã£ã¼ãçµç±ã§æ å ±ãåå¾ããããä¿®æ£ãã¾ããã
RSSãã£ã¼ãçµç±ã«ãã¹ãçç±ãå®ã¯ããã¾ãããç解ã§ãã¦ããªãã¦ãéå°ãªã¢ã¯ã»ã¹ã§ãµã¤ããéããªãããè¯ããªããã¨ãã§ããããï¼(åã®ã³ã¼ããï¼å/dayãããã®é »åº¦ã§åããäºå®ã ã£ãã®ã§ããµã¤ãè² è·ä¸ã¯å¤§ãããã¨ç¡ãã¨æã£ãã®ã§ããããã)
ã¨ããããããè¯ãææ³ããããªãããã£ã¡ã®æ¹ãå¦ã¶ã«è¶ãããã¨ã¯ããã¾ããã
ä»å¾ã人ã«è¿·æããããªãç¯å²ã§ãèªåæ å ±åéã«ç²¾ãåºãã¦ããããã§ããã
åèãªã³ã¯
【データサイエンスの基礎】pythonでRSSからデータ収集
ãã®è¨äºãåèã«ãã¦ãããããªã§ããï¼
【Pythonで形態素解析】RSSから記事タイトルを取得して形態素解析をしてみた
次ã¯å½¢æ
ç´ è§£æãjanomeã§ããäºå®ã§ãããåèã«ã
はてなブックマークのRSSフィードのURLと確認方法 - Sprint Life
ã¯ã¦ãã®RSSãã£ã¼ããè¦ã¤ããåèã«ãªãã¾ããã
GitHub - kurtmckee/feedparser: Parse feeds in Python
Documentation — feedparser 5.2.0 documentation
feedparserã®ããã¥ã¡ã³ããéã®è¨è¨å³å
±æãµã¤ãã¸ã®ãªã³ã¯ã§ãã