ãRubyã§HTMLãXMLããã¼ã¹ããæ§æè§£æãã¼ã«ã®å®çªã¯ãNokogiriã§ããã¹ã¯ã¬ã¤ãã³ã°ããéã®å¿ éåã§ããªãã¦ã¯ãªããªãã¢ã¸ã¥ã¼ã«ã®ï¼ã¤ã§ãããã è²ã ãªãã¨ãåºæ¥ãåé¢ãã©ãããåããããã°è¯ãã®ãè§£ãé£ãé¨åãããã¾ããèªç¿ãå ¼ãã¦ãNokogiriæ¦è¦ã¨ä¸»è¦ãªæ©è½ãç´¹ä»ãã¦ã¿ã¾ãã
Nokogiriã¨ã¯ä½ãï¼
ãReademeã«ããã¨Nokogiriã¨ã¯ããHTMLã¨XMLã¨SAXã¨XSLTã¨Readerã®ãã¼ãµã¼ãã§ãç¹å¾´ã¨ãã¦ã¯ãXPathã¨CSS3ã»ã¬ã¯ã¿ã¼çµç±ã§æ¢ç´¢ããæ©è½ãæã¤ãã¨ã®ããã§ããä»ã«ãHTMLãXMLã®ãã«ãã¼ã®æ©è½ãæã£ã¦ãã¾ãããHTMLã¨XMLã®ãã¼ãµã¼ï¼æ§æè§£æå¨ï¼ã¨è¦ãã¦ããã°ããã§ãããã
Nokogiriã®ã¯ã©ã¹æ§é
ãNokogiriã¯ããªããªã巨大ãªã©ã¤ãã©ãªã§ãã10以ä¸ã®ã¢ã¸ã¥ã¼ã«ã¨70以ä¸ã®ã¯ã©ã¹ã§æ§æããã¦ãã¦ãyardã§ãã¤ã¢ã°ã©ã å³ãä½ã£ã¦ã¿ãã¨ä¸è¨ã®ããã«å£®å¤§ãªãã®ã«ãªãã¾ãã
ãæ£ç´ãã©ãããè¦ã¦ããã°ããã®ãéæ¹ã«æ®ããã¨æãã¾ããNokogoriã使ãåã«ã次ã®ãã¨ã ãè¦ãã¦ãã ãããä½ã¨ãªãé ã«å
¥ãããããªãã¨æãã¾ãã
3ã¤ã®ã¯ã©ã¹ããããã
ãNokogiriãçè§£ããä¸ã§ã¯ãNokogiri::XML::Nodeã®ã¡ã½ããã¨æåãè¦ããã®ã大äºã§ãããã¤ã¢ã°ã©ã å³ãè¦ãã°è§£ãããã«ãNokogiri::XML::Nodeãç¶æ¿ãã¦ãããªãã¸ã§ã¯ãã夿°ããã¾ããã¾ããNokogiri::HTML::Document < Nokogiri::XML::Documentã®é¢ä¿ããè§£ãããã«ãHTMLã¢ã¸ã¥ã¼ã«ãXMLã¢ã¸ã¥ã¼ã«ãç¶æ¿ãã¦ãããã®ã夿°ããã¾ãã
Nokogiri::HTML::Document < Nokogiri::XML::Document < Nokogiri::XML::Node
ãæåãè§£ããªããã°ãNokogiri::XML::Nodeã®ã½ã¼ã¹ããã£ããèªã¿ã¨ãã¨è¯ãã§ãããã¨ã¯ãNokogiri::XML::Documentã¨Nokogiri::HTML::Documentã®ï¼ã¤ã§ãããã®ï¼ã¤ã®ã¯ã©ã¹ããããã¦ããã¨ãNokogiriãåºæ¥ããã¨ã大使æ¡ã§ãã¾ãã
http://nokogiri.org/Nokogiri/XML/Node.html
http://nokogiri.org/Nokogiri/XML/Document.html
http://nokogiri.org/Nokogiri/HTML/Document.html
Nokogiriã§HTMLãè§£æãã
ãä¸çªåºæ¬çãªHTMLã®è§£æã¯ä»¥ä¸ã®éãã§ããopen-uriã¯rubyã®çµã¿è¾¼ã¿ã©ã¤ãã©ãªã§ããKernel#openãåå®ç¾©ãããã®ã§ããã¡ã¤ã«ã¨åæ§ã®æä½ã§http/ftpã«ã¢ã¯ã»ã¹ãããã¨ãã§ãã¾ããNokogiriã¨ä¸ç·ã«å©ç¨ããã¨ä¾¿å©ã§ã
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.yahoo.co.jp'))
Nokogiri::HTMLãããã¯Nokogiri::HTML#parseãå¼ã³åºãã¨ãå é¨çã«Nokogiri::HTML::Document#parseã¡ã½ãããå®è¡ãã¾ããè¿ãå¤ã¨ãã¦ã¯ãNokogiri::HTML::Documentãè¿ãã¾ããä¸è¨ã¨åãã§ãã
doc = Nokogiri::HTML.parse(open('http://www.yahoo.co.jp'))
ãçæããNokogiri::HTML::Documentãããç¹å®ã®ã¿ã°ãæ½åºããã«ã¯ä¸è¨ã®ã¨ããã§ãã
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.yahoo.co.jp')) title = doc.xpath('/html/head/title') objects = doc.xpath('//a')
Nodeã¨NodeSet
ãXPathãcssã®æ¤ç´¢çµæã¯ãNokogiri::XML::NodeSetãè¿ãã¾ããNodeSetã¯ãNokogiri::XML::Nodeã®ãªã¹ãã§ããNodeSetã®ã¡ã½ããã®ï¼ã¤ã§ããNodeSet#inner_text()ã¯ããªã¹ãå ã®å ¨ã¦ã®Nodeã®inner_textãè¿ãã¾ããtextã¯ãinner_textã®ã¨ã¤ãªã¢ã¹ã§ããHTMLã®Titleã¿ã°ã®ããã«ããã¥ã¡ã³ãä¸ã«ï¼ã¤ãããªãã¿ã°ã®å ´åã¯ãä¸è¨ã®ããã«å ¨ã¦åãçµæãè¿ãã¾ããã¾ããHTML::Documentã«ã¯ãTitleã¿ã°ãæãåºãç¹å¥ã®ã¡ã½ãããããã¾ãã
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.yahoo.co.jp')) puts doc.title # => Yahoo! JAPAN nodesets = doc.xpath('//title') puts nodesets.text # => Yahoo! JAPAN puts nodesets.inner_text # => Yahoo! JAPAN puts nodesets.first.inner_text # => Yahoo! JAPAN nodesets.each{|nodeset| puts nodeset.content() # => Yahoo! JAPAN puts nodeset.text # => Yahoo! JAPAN puts nodeset.inner_text # => Yahoo! JAPAN }
ãè¤æ°ã®è¦ç´ ããããããå ´åãçµæã¯éã£ã¦ãã¾ã
nodesets = doc.xpath('//a') puts nodesets.inner_text nodesets.each{|nodeset| puts nodeset.inner_text # => Yahoo! JAPAN }
Nodeã¨NodeSetã®æ¤ç´¢ã¡ã½ãã
ãNodeã¨NodeSetã«ã¯ãæ§ã ãªæ¤ç´¢æ¹æ³ãããã¾ããæ¤ç´¢ã«é¢ãã¦ã¯ãåãã¡ã½ãããå©ç¨å¯è½ã§ããä¸è¨ã®ä¾ã¯ãå ¨ã¦åãçµæãè¿ãã¾ãã
puts doc%'//title' puts doc/'//title' puts doc.at('//title') # => æ¤ç´¢ã«ãããããæåã®ãã¼ããè¿ã puts doc.at_xpath('//title') # => xpathã®æ¤ç´¢ã«ãããããæåã®ãã¼ããè¿ã puts doc.at_css('title') # => cssã®æ¤ç´¢ã«ãããããæåã®ãã¼ããè¿ã puts doc.css('title') # => cssã§æ¤ç´¢ãNodeSetãè¿ã puts doc.css('title')[0] # => cssã§æ¤ç´¢ãNodeSetããæåã®ãã¼ããè¿ã puts doc.search('title') # => xpathãcssã§æ¤ç´¢ãNodeSetãè¿ã puts doc.search('title')[0] # => xpathãcssã§æ¤ç´¢ãNodeSetããæåã®ãã¼ããè¿ã puts doc.xpath('//title') # => xpathã§æ¤ç´¢ãNodeSetãè¿ã puts doc.xpath('//title')[0] # => xpathã§æ¤ç´¢ãNodeSetããæåã®ãã¼ããè¿ã puts doc.xpath('//title').first # => xpathã§æ¤ç´¢ãNodeSetããæåã®ãã¼ããè¿ã
Nodeã¨NodeSetã®åç §ã¡ã½ãã
ãNodeã¨NodeSetã¯ãåç §ã«é¢ãã¦ã»ã¼åãã¡ã½ããã使ãã¾ããä¸é¨NodeSetã§ã¯ä½¿ããªãã¡ã½ãããããã¾ããã¾ããåä¸ã®çµæãè¿ãã¡ã½ããã®å¤ãã¯ã¨ã¤ãªã¢ã¹ã¨ãã¦è¨å®ããã¦ãããã®ã§ãã
#Nodeã®åç § #HTMLã¿ã°å«ã puts doc.at('//title').to_html puts doc.at('//title').to_xhtml puts doc.at('//title').to_xml puts doc.at('//title').to_s #HTMLã¿ã°ã§å²ã¾ããæåå puts doc.at('//title').text puts doc.at('//title').inner_html puts doc.at('//title').inner_text puts doc.at('//title').text puts doc.at('//title').to_str #屿§å¤ã®åå¾ puts doc.at('//a').[]('href') puts doc.at('//a').attribute('href') puts doc.at('//a').get_attribute('href') #NodeSetã®åç § #HTMLã¿ã°å«ã puts doc.xpath('//title').to_html puts doc.xpath('//title').to_xhtml puts doc.xpath('//title').to_xml puts doc.xpath('//title').to_s #HTMLã¿ã°ã§å²ã¾ããæåå puts doc.xpath('//title').text puts doc.xpath('//title').inner_html puts doc.xpath('//title').inner_text puts doc.xpath('//title').text #puts doc.xpath('//title').to_str
æ¤ç´¢æ¹æ³ãããã
ãNokogiriã¨ãããXPathã®æ¤ç´¢æ¹æ³ã§ããidãclassãªã©ã®å±æ§å¤ã§æ¤ç´¢ãããã¨ãå¤ãã§ãããå®ã¯å±æ§å¤ã§ããã°ããªãã§ã使ãã¾ãã屿§å¤æ¤ç´¢ã®å ´åã¯ã[]ã§æå®ãã¾ãã@é¨åã屿§å¤ã®ååã§ãã
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.hatena.ne.jp/')) #classæå®ã§h2ã¿ã°ãæ¤ç´¢ puts doc.xpath("//h2[@class='title']") #idæå®ã§divã¿ã°ãæ¤ç´¢ puts doc.xpath("//div[@id='copyright']") #ã«ã¹ã¿ã ã®å±æ§å¤ã§divã¿ã°ãæ¤ç´¢ puts doc.xpath("//div[@data-component-term='tweet']") #idæå®ã§å ¨ã¦ã®ã¿ã°ãæ¤ç´¢ puts doc.xpath("//*[@id='copyright']") #çµè¾¼æ¤ç´¢ puts doc.xpath("//div[@id='copyright']//ul") #ãªã³ã¯å ã®URLãæãåºã doc.xpath('//a').each do |item| puts item[:href] end
NodeSetãªã®ãElementãªã®ã
ãæ¤ç´¢çµæãã屿§çãåç §ãããã¨ãã¦ã次ã®ãããªã¨ã©ã¼ãåºãå ´åãããã¾ãã
`[]': no implicit conversion of String into Integer (TypeError)
ãæ®ã©ã®å ´åã¯ãElementãåå¾ããã¨æã£ã¦ããã®ã«ãNodeSetãåå¾ãã¦ããã¨ãããã¿ã¼ã³ãå¤ãã§ããããããããééãã¨ãã¦ã¯ã䏿ã®è¦ç´ ãã¨ã£ã¦ãã¦ãããã®ä¸ã«ã¿ã°ããã£ã¦NodeSetã«ãªã£ã¦ããå ´åã§ãã
ãã¤ã¾ã£ãããåå¾ãããã¼ã¿ãNodeSetãªã®ãElementãªã®ãã確èªãã¾ããããclassã§ç¢ºèªã§ãã¾ãã
doc.xpath("//div[@data-component-term='tweet']").each { |tweet| puts tweet.class #=> Nokogiri::XML::NodeSet puts tweet[0].class #=> Nokogiri::XML::Element
Nokogiriã§XMLãè§£æãã
ãNokogiriã¯ãHTMLã®ã¿ãªããXMLãè§£æã§ãã¾ããXMLã¯æ§é ãã·ã³ãã«ãªã®ã§ãè§£æã¯HTMLãããç°¡åã«åºæ¥ãã¨æãã¾ããããããä¸ç¹ã ãæ³¨æãå¿ è¦ã§ããNokogiriã使ã£ã¦XPathã§æ¤ç´¢ããå ´åããã®XMLãåå空éãããå ´åã¯å¿ ãæå®ããå¿ è¦ãããã¨ããç¹ã§ããæå®ããªãã¨ãå ¨ãæ¤ç´¢ã«ããããã¾ããã
ãä¸è¨ã®ä¾ã¯ãã¯ã¦ãªã®ãããã¨ã³ããªã¼ã®RSSãã£ã¼ããæ½åºããä¾ã§ããRSS1.0ãªã®ã§åå空éãæã¡ã¾ãã
require 'nokogiri' require 'open-uri' url = 'http://feeds.feedburner.com/hatena/b/hotentry' xml = open(url).read doc = Nokogiri::XML(xml) namespaces = { "rss" => "http://purl.org/rss/1.0/", #ããã©ã«ãåå空é "rdf" => "http://www.w3.org/1999/02/22-rdf-syntax-ns#", "content" => "http://purl.org/rss/1.0/modules/content/", "dc" => "http://purl.org/dc/elements/1.1/", "feedburner" => "http://rssnamespace.org/feedburner/ext/1.0" } #channel channel = doc.xpath('//rss:channel', namespaces) #Xpathã§titleãæ¤ç´¢ puts channel.xpath('rss:title', namespaces) puts channel.xpath('feedburner:info', namespaces) lis = channel.xpath('//rdf:li', namespaces) lis.each {|li| puts li.attribute("resource") }
Nokogiriã¨Hpricot
ãRuby製ã®HTML/XMLãã¼ãµã¼ã¨ãã¦ã¯ãhpricotã鏿è¢ã®ï¼ã¤ã§ãããããããªãããä»ã§ã¯å ¬å¼ã«ãHpricot is over.ãã¨å®£è¨ãããNokogiriã使ãã¾ãããã¨ãªã£ã¦ãã¾ããNokogiriã¨hpricotã¯ãAPIãªã©å ±éç¹ãå¤ãã®ã§ãããä¸ç¹ã ã大ããªéããããã¾ããNokogiriã¯å é¨ã®å¦çãUTF-8ã§è¡ãã«å¯¾ãã¦ãhpricotã¯ä¸ããããæåã³ã¼ãããã®ã¾ã¾æ±ã夿ããã¾ããããã®çºãæåã³ã¼ã絡ã¿ã®ãã©ãã«ãé¿ããçºã«ãä»ã§ãhpricotãå©ç¨ãã¦ãã人ãããããã§ãã
XPathã¨CSS3ã»ã¬ã¯ã¿ã®ã©ã¡ãã使ãã
ãNokogiriã®ãã¼ãã®æ¢ç´¢ã¯ã主ã«XPathãCSS3ã»ã¬ã¯ã¿ã¼ãä½¿ãæ¹æ³ãããã¾ãã両æ¹å ±è¦ããå¿ è¦ã¯ããã¾ãããã©ã¡ããã®ä½¿ãæ¹ãè¦ããã°å åã§ããCSSã®æ¹ã徿ã§ããã°ãCSS3ã»ã¬ã¯ã¿ã¼ã使ãã¾ããããã©ã¡ãã徿ã§ãªãã®ã§ããã°ãæ¯è¼çãµã³ãã«ãå¤ãXPathã使ãã°è¯ãã®ã§ãããããã¶ã¤ãã¼ç³»ã§ããã°CSSã®æ¹ã徿ã¨ãã人ãå¤ãã§ãããããã®å ´åã¯ãCSSã使ãã°ããã¨æãã¾ãã
Nokogiriã®è§£æã®å®è£
ãNokogiriã®XML,HTMLã®ãã¼ã¹ã¯ãåºæ¬çã«ã¯LibXML2ã«ä¾åãã¦ãã¾ããå é¨ã®ã½ã¼ã¹ãè¦ã¦ããã¨ãLibXML2ã«æå¥ãè¨ãã¤ã¤è¶³ããªãæ©è½ãè£å®ãããã¨ãã¦ããã®ãè§£ãã¾ããNokogiriã¨ã¯æ¥µè«ããã¨ãLibXML2ã®Rubyã®ã©ããã¼ã¢ã¸ã¥ã¼ã«ã§ãããããé ã«å ¥ãã¦ããã¨ãHTMLã§ãXMLã§ãåãèãæ¹ã§è§£æãã¦ããã®ãè§£ãã¾ãã
Nokogiriã®åã¡ã½ããã®ã¨ã¤ãªã¢ã¹
ãNokogiriã®ã¡ãã£ã¨åã£ä»ãã«ããçç±ã¨ãã¦ãåãå¦çãæ§ã ãªãã¿ã¼ã³ã§è¨è¿°ã§ããã¨ããã«ããã¾ããè²ã ãªãµã³ãã«è¦ã¦ããå¤ç¨®å¤æ§ã§åãã¦ã®äººã¯é¢é£ããã¾ãããã®ä¸å ã¨ãªã£ã¦ããã®ããã¡ã½ããã®ã¨ã¤ãªã¢ã¹ã®å¤ããåºæ¬çã«ã¯ãèªåã馴æãã ååã§ä½¿ãã®ãããã§ãããã
$ grep -R alias * css/parser_extras.rb: alias :cache_on? :cache_on css/parser_extras.rb: alias :set_cache :cache_on= css/tokenizer.rb: alias :scan :scan_str xml/attr.rb: alias :value :content xml/attr.rb: alias :to_s :content xml/attr.rb: alias :content= :value= xml/document.rb: alias :to_xml :serialize xml/document.rb: alias :clone :dup xml/document.rb: alias :<< :add_child xml/document_fragment.rb: alias :serialize :to_s xml/node/save_options.rb: alias :to_i :options xml/node.rb: alias :/ :search xml/node.rb: alias :% :at xml/node.rb: alias :next :next_sibling xml/node.rb: alias :previous :previous_sibling xml/node.rb: alias :next= :add_next_sibling xml/node.rb: alias :previous= :add_previous_sibling xml/node.rb: alias :remove :unlink xml/node.rb: alias :get_attribute :[] xml/node.rb: alias :attr :[] xml/node.rb: alias :set_attribute :[]= xml/node.rb: alias :text :content xml/node.rb: alias :inner_text :content xml/node.rb: alias :has_attribute? :key? xml/node.rb: alias :name :node_name xml/node.rb: alias :name= :node_name= xml/node.rb: alias :type :node_type xml/node.rb: alias :to_str :text xml/node.rb: alias :clone :dup xml/node.rb: alias :elements :element_children xml/node.rb: alias :delete :remove_attribute xml/node.rb: alias :elem? :element? xml/node.rb: alias :add_namespace :add_namespace_definition xml/node_set.rb: alias :<< :push xml/node_set.rb: alias :remove :unlink xml/node_set.rb: alias :/ :search xml/node_set.rb: alias :% :at xml/node_set.rb: alias :set :attr xml/node_set.rb: alias :attribute :attr xml/node_set.rb: alias :text :inner_text xml/node_set.rb: alias :size :length xml/node_set.rb: alias :to_ary :to_a xml/node_set.rb: alias :+ :| xml/parse_options.rb: alias :to_i :options xml/reader.rb: alias :self_closing? :empty_element? xml/sax/push_parser.rb: alias :<< :write
ã¾ã¨ã
ãã¤ãã¤ãã¾ã¨ã¾ãã¾ããã§ããããNokogiriã®ä½¿ãæ¹ã«ã¤ãã¦èªåãªãã«çè§£ãæ·±ã¾ã£ããããªæ°ããã¾ããRuby使ãã®ã§ããã°ã使ãããªãããã©ã¤ãã©ãªã®ä¸ä½ã«å ¥ãã¨æãã¾ããå°ãæéãæãã¦ããã¥ã¡ã³ããã½ã¼ã¹ãèªããã¨ã§çè§£ã¯æ ¼æ®µã«ãããã®ã§ããã²ä¸åº¦ã試ãããï¼ï¼
追è¨ï¼
ããã§ãã©ããã£ã¦XPathãæ½åºããã®ã¨ããè©±ãæ¸ãã¾ãã
FireFoxやChromeを使って任意のノードのXPathを簡単に抽出する方法について - プログラマになりたい
追è¨ï¼ï¼
ãã®è¾ºãã®è©±ãã¾ã¨ããããRubyによるクローラー開発技法ãã¨ããæ¬ãåºãã¦ãã¾ãã
See Also:
Ruby製のクローラー Anemoneの文字化け対策
あらためてRuby製のクローラー、"anemone"を調べてみた
オープンソースのRubyのWebクローラー"Anemone"を使ってみる
JavaScriptにも対応出来るruby製のクローラー、Masqueを試してみる
複数並行可能なRubyのクローラー、「cosmicrawler」を試してみた
takuros/anemone · GitHub
åç
§ï¼
sparklemotion/nokogiri · GitHub
Tutorials - Nokogiri 鋸

Rubyã«ããã¯ãã¼ã©ã¼éçºææ³ å·¡åã»è§£ææ©è½ã®å®è£ ã¨21ã®éç¨ä¾
- ä½è : ãã³ãã¡,ä½ã æ¨æé
- åºç社/ã¡ã¼ã«ã¼: SBã¯ãªã¨ã¤ãã£ã
- çºå£²æ¥: 2014/08/25
- ã¡ãã£ã¢: 大忬
- ãã®ååãå«ãããã° (6ä»¶) ãè¦ã

Spidering hacksâã¦ã§ãæ å ±ã©ã¯ã©ã¯åå¾ãã¯ããã¯101é¸
- ä½è : Kevin Hemenway,Tara Calishain,æä¸é ç«
- åºç社/ã¡ã¼ã«ã¼: ãªã©ã¤ãªã¼ã»ã¸ã£ãã³
- çºå£²æ¥: 2004/05
- ã¡ãã£ã¢: åè¡æ¬
- è³¼å ¥: 52人 ã¯ãªãã¯: 904å
- ãã®ååãå«ãããã° (104ä»¶) ãè¦ã

ããããã¼ã¿ãã³ããã㯠âãã¼ã¿ã«ã¾ã¤ããåé¡ã¸ã®19ã®å¦æ¹ç®
- ä½è : Q. Ethan McCallum,ç£¯èæ°´(ç£è¨³),笹äºå´å¸
- åºç社/ã¡ã¼ã«ã¼: ãªã©ã¤ãªã¼ã¸ã£ãã³
- çºå£²æ¥: 2013/09/26
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- ãã®ååãå«ãããã° (9ä»¶) ãè¦ã