Hpricot ããããã¹ããåãåºã
scrAPIããã使ããããæãã®Hpricotã§ããããinnerTextããä¸æãHTMLã¨ã³ãã£ãã£ã¼ãæ»ãã¦ãããªãã®ã§ãéãã¡ã½ãããã¤ãã¦ã¿ã¾ããã
require "rubygems" require 'hpricot' class Hpricot::Elem def [](a) CGI.unescapeHTML(get_attribute(a)) end def to_text r = [] traverse_text{|text| case text when Hpricot::CData r << text.content else r << CGI.unescapeHTML(text.inner_text.gsub("\n"," ").gsub(/ +/," ").strip) end } r.join end end
hp = Hpricot('<html><boge href="hoge&neko">test& test & test<![CDATA[ hoge <&> hoge ]]></boge>') hp.root.inner_text #ãªãªã¸ãã« # => "test& test & test hoge <&> hoge " hp.root.to_text # => "test& test & test hoge <&> hoge " hp.root.at("boge").get_attribute(:href) #ãªãªã¸ãã« # => "hoge&neko" hp.root.at("boge")[:href] # => "hoge&neko"
Hpricotã¯cssã»ã¬ã¯ã¿ãã¤ããããï¼
ç¥ããªã人ãå¤ãããã ãã©
doc = Hpricot(open("http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE")) doc.at("span.furigana").to_text # => ããã®ããã¿ doc.at("span.title > a:first-child").to_text # => ç´ºéããç¾ doc.at("ul.list-circle > li:first-child > a").to_text # => ã¢ã¤ãã«
ï¼ãµã³ãã«ã¯ å·oã»-ã»ï¼ï¼2nd life - ruby ã®ã¹ã¯ã¬ã¤ãã³ã°ãã¼ã«ããã scrAPI ãã¤ã³ã¹ãã¤ã¢ï¼
t*追è¨ï¼
CGI.unescapeHTML ãããã®ã¾ã¾ã 㨠ã& &ãã¨ãä¸æãå ã«æ»ããªãã®ã§ãhttp://d.hatena.ne.jp/walf443/20070204/1170605669 ã®ä¿®æ£ãããã¨ãããï¼