nazokingã®ãƒ–ãƒã‚°

Hpricot ã‹ã‚‰ãƒ†ã‚ã‚¹ãƒˆã‚’å–ã‚Šå‡ºã™

Ruby

scrAPIã‚ˆã‚Šã‚‚ä½¿ã„ã‚„ã™ã„æ„Ÿã˜ã®Hpricotã§ã™ãŒã€ã€ŒinnerTextã€ãŒä¸Šæ‰‹ãHTMLã‚¨ãƒ³ãƒ†ã‚£ãƒ†ã‚£ãƒ¼ã‚’æˆ»ã—ã¦ãã‚Œãªã„ã®ã§ã€é•ã†ãƒ¡ã‚½ãƒƒãƒ‰ã‚’ã¤ã‘ã¦ã¿ã¾ã—ãŸã€‚

require "rubygems"
require 'hpricot'

class Hpricot::Elem
  def [](a)
    CGI.unescapeHTML(get_attribute(a))
  end
  def to_text
    r = []
    traverse_text{|text|
      case text
      when Hpricot::CData
        r << text.content
      else
        r << CGI.unescapeHTML(text.inner_text.gsub("\n"," ").gsub(/  +/," ").strip)
      end
    }
    r.join
  end
end

hp = Hpricot('<html><boge href="hoge&neko">test& test &amp;  test<![CDATA[ hoge <&amp;> hoge ]]></boge>')

hp.root.inner_text #ã‚ªãƒªã‚¸ãƒŠãƒ«
# => "test& test &amp;  test hoge <&amp;> hoge "
hp.root.to_text
# => "test& test & test hoge <&amp;> hoge "

hp.root.at("boge").get_attribute(:href) #ã‚ªãƒªã‚¸ãƒŠãƒ«
# => "hoge&amp;neko"
hp.root.at("boge")[:href]
# => "hoge&neko"

Hpricotã¯css ã‚»ãƒ¬ã‚¯ã‚¿ã‚‚ã¤ã‹ãˆã‚‹ã‚ˆï¼

çŸ¥ã‚‰ãªã„äººã‚‚å¤šã„ã‚ˆã†ã ã‘ã©

doc = Hpricot(open("http://d.hatena.ne.jp/keyword/%BA%B0%CC%EE%A4%A2%A4%B5%C8%FE"))
doc.at("span.furigana").to_text
# => ã“ã‚“ã®ã‚ã•ã¿
doc.at("span.title > a:first-child").to_text
# => ç´ºé‡Žã‚ã•ç¾Ž
doc.at("ul.list-circle > li:first-child > a").to_text
# => ã‚¢ã‚¤ãƒ‰ãƒ«

ï¼ˆã‚µãƒ³ãƒ—ãƒ«ã¯ å·oãƒ»-ãƒ»ï¼‰ï¼œ2nd life - ruby ã®ã‚¹ã‚¯ãƒ¬ã‚¤ãƒ”ãƒ³ã‚°ãƒ„ãƒ¼ãƒ«ã‚ãƒƒãƒˆ scrAPI ã‚’ã‚¤ãƒ³ã‚¹ãƒ‘ã‚¤ã‚¢ï¼‰

t*è¿½è¨˜ï¼š

CGI.unescapeHTML ãŒã€ãã®ã¾ã¾ã ã¨ ã€Œ& &ã€ã¨ã‹ä¸Šæ‰‹ãå…ƒã«æˆ»ã›ãªã„ã®ã§ã€http://d.hatena.ne.jp/walf443/20070204/1170605669 ã®ä¿®æ£ã‚’ã™ã‚‹ã¨ã„ã„ã‚ˆï¼