ãï¼å¹´ã»ã©åã«ãRuby製ã®ã¯ãã¼ã©ã¼"anemone"ãç´¹ä»ãã¾ããããã®å½æããå®æ度ãé«ããRubyã§ã¯ãã¼ã©ã¼ã使ãå ´åã¯anemoneãå©ç¨ãã¦ãã¾ãããæè¿ãä»ã«æ°ããã¦è¯ãã®ããªãã調ã¹ã¾ããããæ©è½é¢ã®ç¶²ç¾
æ§ã¨ããæå³ã§anemoneãè¶
ãããã®ã¯è¦ã¤ãããã¾ããã§ãããããã§æ¹ãã¦anemoneã®ã½ã¼ã¹ãèªãã§ã¿ãã¨ãããã¯ãã¼ã©ã¼ãå¿
è¦ã¨ããæ©è½ãå¿
è¦æå°éã§å®è£
ããããã¯ãä¸ã
è¯ãåºæ¥ã§ããå¬ä¼ã¿ã®å®¿é¡ã§ã¯ãªãã§ãããåå¼·ã®æå³ãå
¼ãã¦ã½ã¼ã¹ã追ã£ã¦ãããã¨ã«ãã¾ãã
Anemoneãå©ç¨ãã¦ããã©ã¤ãã©ãªä¸è¦§
ãanemoneãå©ç¨ãã¦ããã©ã¤ãã©ãªã¯ãï¼ç¨®é¡ã«åé¡ã§ãã¾ãã
- Rubyæ¨æºorä¸è¬çãªã©ã¤ãã©ãª
- ãã¼ã¿åå¾ã§å©ç¨ãã¦ããã©ã¤ãã©ãª
- ãã¼ã¿è§£æã§å©ç¨ãã¦ããã©ã¤ãã©ãª
- ãã¼ã¿ä¿åã§å©ç¨ãã¦ããã©ã¤ãã©ãª
ãã®åé¡å¥ã«æ§é ãã¿ãã¨ããããããã®ã§ãé çªã«è¿½ã£ã¦ããã¾ãã
Rubyæ¨æºorä¸è¬çãªã©ã¤ãã©ãª
require 'rubygems'
require 'delegate'
require 'forwardable'
require 'optparse'
require 'thread'
ãdelegateã¨forwardableã¯ãã¡ã½ããã®å§è²ãè¡ãrubyæ¨æºã®ã©ã¤ãã©ãªã§ããoptparseã¯ãã³ãã³ãã©ã¤ã³ã®ãªãã·ã§ã³ãåãæ±ãããã®ã©ã¤ãã©ãªã§ããthreadã¯ä¸¦è¡ããã°ã©ãã³ã°ãè¡ãçºã®ã©ã¤ãã©ãªã§ããç¹çãã¹ãã¨ããã¯ãä½ãããã¾ãããå¯ä¸æ°ã«ãªã£ãã®ã¯ãdelegateã¨forwardableã®ä½µç¨ã«ã¤ãã¦ã§ããanemoneã§ã¯ãcookie_storeã®å®è£
ã®é¨åã®ã¿delegateã使ãããã¼ã¿ä¿åã®é¨åã§åã¹ãã¬ã¼ã¸ï¼kyoto_cabinet,pstore,tokyo_cabinetï¼ã«ã¤ãã¦ã¯forwardableã使ã£ã¦ãã¾ãããã®ï¼ã¤ã®ã¢ã¸ã¥ã¼ã«ã®é¸æã®ãã¤ã³ãã«ã¤ãã¦ã¯ããã解ã£ã¦ãã¾ãããã¹ãã¬ã¼ã¸æ©è½ã®å®è£
ææãï¼å¹´ã»ã©å¾ã¨ãããã¨ããããæµåãå¤ãã£ãå¯è½æ§ãããã¾ãããããã¯ã移è²ã«ã¤ãã¦æ示çã«æå®ãããã©ããã®æã§ã移è²ããã¡ã½ããæ°ã«ããå¤æã®å¯è½æ§ãããã¾ãã
ãã¼ã¿åå¾æ©è½ã®æ§é
require 'net/https'
require 'webrick/cookie'
require 'robotex'
ããã¼ã¿åå¾æ©è½ã«ã¤ãã¦ã¯ãcore.rbã¨http.rbã§å®è£ ããã¦ãã¾ãããã¼ã¿åå¾ã®çºã®ã©ã¤ãã©ãªã¨ãã¦ã¯ãéä¿¡é¨åã«ã¯æ¨æºã®net/httpsãå©ç¨ãã¦ãã¾ããcookieã®åæ±ã¯ãwebrick/cookieãå©ç¨ãã¦ãã¾ããååãã解ãããã«ãWebãµã¼ãã¼ç¨ãã¬ã¼ã ã¯ã¼ã¯ã®Webrickãå©ç¨ãã¦Cookieã®å¦çãè¡ã£ã¦ããã®ã§ãããããã¦ãrobotexã§ãããã®ã©ã¤ãã©ãªã¯ãanemoneã®ä½è ã§ããChris Kiteã«ããã©ã¤ãã©ãªã§ããrobots.txtã®å¤å®ãå¥ã¢ã¸ã¥ã¼ã«ã¨ãã¦å¤åºãã«ãã¦ãã¾ãããã®é¨åã¯ãèªåã§ã¯ãã¼ã©ã¼ãä½æããå ´åã«ãå©ç¨åºæ¥ã¾ãã使ãæ¹ã¯ãanemoneã§ã¯æ¬¡ã®ããã«ãªã£ã¦ãã¾ããã
ããã©ã«ãã®è¨å®ãrobots.txtã«å¾ããªãããã«ãªã£ã¦ãã¾ãã
# don't obey the robots exclusion protocol :obey_robots_txt => false,
å¼æ°ã§robots.txtã«å¾ãããã«è¨å®ããå ´åãå¤æ°@robotsãä½æãã¦ãã¾ãã
@robots = Robotex.new(@opts[:user_agent]) if @opts[:obey_robots_txt]
Robotexã¢ã¸ã¥ã¼ã«ã®ä½¿ãæ¹ã¯ã次ã®éãã§ããrobots.txtã«å¾ãå ´åãRobotexã¢ã¸ã¥ã¼ã«ã®allowdã¡ã½ããã§ãªã³ã¯å
ãåå¾å¯è½ãã®ç¢ºèªããã¦ãã¾ããï¼å度ãªãã·ã§ã³ã®:obey_robots_txtãè¦ã«è¡ã£ã¦ããã®ã¯ãå¾®å¦ãªæ°ããã¾ããï¼
def allowed(link) @opts[:obey_robots_txt] ? @robots.allowed?(link) : true rescue false end
ãã®allowdã¡ã½ããããå®é使ããã¦ããæã§ããvisit_linkã¡ã½ããã§Andæ¡ä»¶ã§è¨ªåå¯è½ã確èªãã¦ãã¾ãã
def visit_link?(link, from_page = nil) !@pages.has_page?(link) && !skip_link?(link) && !skip_query_string?(link) && allowed(link) && !too_deep?(from_page) end
ããã®å®è£
ã§ããã°ãåä¸ãµã¤ãã§ãé½åº¦robots.txtã確èªãããããªæ°ããã¾ãã念ã®çºãRobotexã¢ã¸ã¥ã¼ã«ã®å®è£
ã確èªãã¦ã¿ã¾ããçµè«çã«ã¯ãä¸åº¦ç¢ºèªãããµã¤ãã«ã¤ãã¦ã¯ãrobots.txtã®ååå¾ãããªããããªä½ãã«ãªã£ã¦ãã¾ããä¸å®å¿ã§ãã
def allowed?(uri, user_agent) return true unless @parsed ã ãç¥ã end
ãã¼ã¿è§£ææ©è½ã®æ§é
require 'nokogiri'
require 'ostruct'
ããã¼ã¿è§£ææ©è½ã«ã¤ãã¦ã¯ãpage.rbã§å®è£
ããã¦ãã¾ããããã¦æ®ã©ã®å¦çã®å®æ
ã¯ãnokogiriã§ã®ãã¼ã¹ã§ãããã®çºãanemoneã®å©ç¨å
ã®æ¹ã§ãnokogiriã使ã£ã¦èªç±ã«å å·¥åºæ¥ã¾ãã
ex)å©ç¨ä¾
Anemone.crawl("http://www.example.com/") do |anemone| anemone.on_every_page do |page| title = page.doc.xpath("//head/title/text()").first.to_s if page.doc puts title end end
ãHTML解æã®ä¸ã®ãªã³ã¯ã®æ¤ç´¢ã«ã¤ãã¦ã¯ãaã¿ã°ä¸ã®hrefãæ¤ç´¢ãã¦ããã ãã®ããã§ããFormãJavaScriptçã§é£ã³å
ãæå®ãã¦ããã®ã¯ãåãã¾ãããã¯ãã¼ãªã³ã°ã§è¿·æãæãããã¨ãé²æ¢ããã«ã¯ããã®aã¿ã°ã®ã¿åå¾ããå®è£
ãè³¢æã ã¨æãã¾ãã
def links return @links unless @links.nil? @links = [] return @links if !doc doc.search("//a[@href]").each do |a| u = a['href'] next if u.nil? or u.empty? abs = to_absolute(u) rescue next @links << abs if in_domain?(abs) end @links.uniq! @links end
ãã¼ã¿ä¿åæ©è½ã®æ§é
require 'kyotocabinet'
require 'mongo'
require 'tokyocabinet'
require 'pstore'
require 'redis'
require 'sqlite3'
ãanemoneã¯ãåå¾ãããã¼ã¿ã®ä¿åå
ã®é¸æè¢ãè±å¯ã§ããåæã¯ãsqlite3ã®ããã«RDBMSããpstoreãªã©ã®æ¨æºã®ãã¡ã¤ã«ãªãã¸ã§ã¯ãã®ã¿ã§ããããã¾ã§ã¯ãtokyocabinet/kyotocabinetã»redisã®ãããªãã¼ããªã¥ã¼ã¹ãã¢ããmongoDBã®ãããªNoSQLã«ã対å¿ããããã«ãªã£ã¦ãã¾ããå±¥æ´ãè¦ã¦ããã¨ãå©ç¨è
ããã®PullRequestããã¼ã¸ããã¦ãã模æ§ã§ãã
ãæ§é çã«ã¯ãanemone/storageã«ã¹ãã¬ã¼ã¸ãã¨ã®å®è£
ã追å ããã¨ããå½¢ã«ãªã£ã¦ãã¾ããå²ã¨ç°¡åã«è¿½å ã§ããããªã®ã§ã試ãã«Amazon S3ãå©ç¨ããã¿ã¤ãã§ãä½ã£ã¦ã¿ããã¨æãã¾ãã
ã¾ã¨ã
ããã£ã¨anemoneã®æ§é ã確èªãã¦ã¿ã¾ããããã¼ã¸åå¾ã®é¨åã¯ãcoreã®é¨åã¨å¯ãªçµåã«ãªã£ã¦ããããã§ããå対ã«ãã¼ã¸è§£æããã¼ã¿ä¿åã«ã¤ãã¦ã¯ãæ¯è¼çççµåã«ãªã£ã¦ãã¾ãããã¨ãã¨ããã¼ã¸åå¾é¨åã ãåãæ¿ãå¯è½ãç¥ãããã¦èª¿ã¹ã¾ãããhttp.rbãå¼ã³åºãã¦ããé¨åãç½®ãæããã°åºæ¥ãªãããªãããã§ãããããããããã¼ã¸è§£ææ©è½ããã¼ã¿ä¿åæ©è½ã移æ¤ããæ¹ã楽ããã§ãã
ãanemoneã¯æ¯è¼çå°ããªã©ã¤ãã©ãªã§ãããè²ã ãªè¦ç´ ãããã¾ããã½ã¼ã¹ãé çªã«èªãã§ããã¨ãä¸ã åå¼·ã«ãªãã¾ãããèå³ãããæ¹ã¯ãæéãããã¨ãã«ä¸åº¦èªãã§ã¿ã¦ã¯ãããã§ããããï¼ enjoy!!
PR
anemoneã®è§£èª¬ãå«ãã¦ãRubyによるクローラー開発の本ãæ¸ãã¾ããã
ã¯ãã¼ã©ã¼ã®æ¦å¿µããå®éã®æ§ç¯ã»éç¨æé ã網ç¾
ãã¦ãã¾ãã
See Also:
オープンソースのRubyのWebクローラー"Anemone"を使ってみる
JavaScriptにも対応出来るruby製のクローラー、Masqueを試してみる
複数並行可能なRubyのクローラー、「cosmicrawler」を試してみた
åç
§ï¼
chriskite/anemone · GitHub
Rubyist Magazine - 標準添付ライブラリ紹介 【第 6 回】 委譲
anemone RDoc
PythonとかScrapyとか使ってクローリングやスクレイピングするノウハウを公開してみる! - orangain flavor
Rubyã«ããã¯ãã¼ã©ã¼éçºææ³ å·¡åã»è§£ææ©è½ã®å®è£ ã¨21ã®éç¨ä¾
- ä½è : ãã³ãã¡,ä½ã æ¨æé
- åºç社/ã¡ã¼ã«ã¼: SBã¯ãªã¨ã¤ãã£ã
- çºå£²æ¥: 2014/08/25
- ã¡ãã£ã¢: 大åæ¬
- ãã®ååãå«ãããã° (4件) ãè¦ã