[Ruby] XPathã¨ã
çµå±HTML Scrapingã§ããï¼
Amazon ECSã®ã·ã¼ã¯ã¬ãããã¼(ã ã£ãï¼ãããã¼ã)ãå
¥ããã¾ã¾ã ã¨ãã¢ããªãå®æãã¦ãå
¬éã§ããªããã©ããããï¼
ããããèããã®ã ãã©ãå½é¢ã¯HTMLããã¬ã·ã¬ã·ãã¼ã¿ãæãã ãã¦ããæ³¥èãã¢ããã¼ãã§è¡ãäºã«æ±ºãã¾ããã
ä¸å¿æ¤è¨ããã®ã¯â¦
- Hashãä½ãé¨åãObjective-Cã«ãã¦ãã®é¨åã¯ã½ã¼ã¹ãå ¬éããªããã½ã¼ã¹ã¯å ¬éãããã©ãã½ã¼ã¹ãããã«ãããã人ã¯èªåã§ECSã®ã¢ã«ã¦ã³ãä½ã£ã¦ãã ããã¨ããã¹ã¿ã³ã¹
- APIããã·ãã使ããã¦ããã(and/or èªåã§ãç«ã¦ã)
ã¨ãã代æ¿æ¡ã
1ã®æ¹ã¯ããªãã¼ã¹ã¨ã³ã¸ãã¢ãªã³ã°ããããç°¡åã«ãã¼ãã²ã£ãã¬ãããã ãããã¨ããäºã§ãããããã¼ãç¥ããã¦ãã¾ãäºãã©ã®ç¨åº¦åé¡ãªã®ããããããããªãã®ã§ãåã«ãæ¸ãããªããªã¼ãã³ã½ã¼ã¹ã®ã©ã¤ã»ã³ã¹çãªäºã¯å¥ã«åé¡ã«ãªããªãã¨æãããããã®æ°ã«ãªãã°å¥ã®å®è¡ãã¡ã¤ã«ã«ããã°ããããã ãããããã·ã¼ãã¢ããªã«å梱ããã¿ãããªã¤ã¡ã¼ã¸ã§ããã°åé¡ãªãã
2ã®æ¹ã®åé¡ã¯ãã©ã¤ãã·ã¼ããããã·ã¼ã®ãµã¼ãã¼ã«ã·ã§ããã³ã°ã«ã¼ãã®ä¸èº«ãéããã¦ãã¾ãã¨ããã®ãæ°ã«ãªããå人ãç¹å®ããæ å ±ã¯IPã ãããéãããªãã®ã§ããããªã«åé¡ãªãããªã¨æãããã©ããã£ã±ãæ°ã«ãã人ã¯æ°ã«ãããããããªããä¸ã®ä¸ã«ã¯ãæ¬å±ããã§è²·ãã®ãæ¥ããæ¬ãAmazonã§è²·ã£ã¦ã¾ããã¨ãã人ãæå¤ã«å¤ãã¿ãããªã®ã§*1ã
åé¡ã«ãªããã©ããããèªä½ãããããã®æ³¨æäºé ã説æããäºã§ããªããé¢åããããâ¦ãã¨ããå°è±¡ãä¸ãã¦ãã¾ãã®ããããªã®ã§ããã®æ¹æ³ã¯ãã¹ã
HTML Scrapingãã¡ããäºã®ãã¦ã³ãµã¤ãã§ããããµã¼ãã¼ã«çºè¡ãããªã¯ã¨ã¹ãã®éã¯ããããäºãªãã®ã§*2ãã¢ããªå´ã®ããã©ã¼ãã³ã¹æå¤ã®é¢ã§ã¯ããã¾ãæ°ã«ããå¿ è¦ã¯ãªãããã§ããå®ã¯ããããã£ã¦ã¿ãããããã¦å¤ããªãäºãå¤æã
XPath便å©ã ãXPath
å®ã¯æè¿ä»äºã§XMLãå¦çããæ©ä¼ããã£ãã®ã§ãXPathã®ä½¿ãæ¹ãè¦ãã¦ãããããªãã¨æã£ã¦ãã¨ããã§ããã§ãä»åã®ãé¡ã使ã£ã¦ç·´ç¿ã
XPathã«ã¤ãã¦ã¯ãã ãããã®äºã¯ç¥ã£ã¦ãã®ã§ãç´°ããç¹ã以ä¸ã®ææ¸ã§èª¿ã¹ããã ãããç¨ã足ãã¾ããã
http://dret.net/lectures/xml-fall06/xpath-chapter.pdf
ã¢ã¤ãã ã®æãã ã
ãã¨ãã¨ãã·ã§ããã³ã°ã«ã¼ãã®ãã¼ã¸ããåã¢ã¤ãã ã®ã¸ã®ãªã³ã¯ãæãã ãå¦çã¯ããã£ãããã¡ããã¡ãã«ãªã£ã¦ãã¾ããã
ã¾ãã¯å
ã
ã©ããªæãã§ãã£ã¦ããã説æãã¾ãã¨â¦
ã«ã¼ãã®ä¸ã®ã¢ã¤ãã ã®IDã¯
<a href="http://www.amazon.co.jp/exec/obidos/ASIN/4877832068/ã»ã«ãããã">ããã«ã¿ã¤ãã«</a>
ã¿ãããªæãã§åºç¾ãã¾ãããã®ä¸ã®ASINã®æ¬¡ã«ãã4877832068ã¨ããã¢ã¤ãã ã®IDãªããã§ãã
ãªã®ã§ãã¼ã¸ãããªã³ã¯ãå ¨é¨æãã ãã¦ãURLãâã®ãã¿ã¼ã³ã«ããããããã®ã ããæãã ãã°ãããã¨ããæ¹éã§ãã£ã¦ãã¾ããã
ãã ãã«ã¼ãã®ãã¼ã¸ã®ã¢ã¤ãã ã®ä¸ã«ã¯ã«ã¼ãã®ä¸ã®ã¢ã¤ãã æå¤ã«ãã¢ã¤ãã ã®ãªã³ã¯ãããããããã¾ã(æè¿ãã§ãã¯ããååã¨ã)ããããã¨æ¬å½ã«è¦åãããããã©ãããå¿é ã§ãããä¸å¿ããã¨è¦ãæãã ã¨ã大ä¸å¤«ãããªãã§ããâ¦
ããã¨ãã«ã¼ãã®æåã®ãã¼ã¸ã«ã¯ãã«ã¼ãã®ä¸ã®åå(以ä¸Activeã¨ããã¾ã)ã¨ãä»ã¯ãããªããã«ããåå(以ä¸Saved)ã®ã両æ¹ã®ãªã³ã¯ãããã¾ããããããè¦åããå®å®ããæ¹æ³ã欲ããã¨ããã
ããã§ã注ç®ããã®ã
<a name="1" />
ã¿ãããªãªã³ã¯ããããªæãã®nameã®æå®ããå¿ ãActiveã®ååã®ç´åã«ç½®ããã¦ãã¾ã(æ°åã¯1çªç®ã®ã¢ã¤ãã ã¯1ã2çªç®ã¯2ã¨ããæãã§å¤å)ãã¾ãSavedã®ååã«ã¯
<a name="s1" />
ã¨ããnameã®ã¢ã³ã«ã¼ãããã¾ãã
Nokogiri#search("a")ãã¦
- ã¾ãç®çã®nameã®aã¿ã°ãããã¾ã§ã«ã¼ã
- nameãè¦ä»ãã£ãããã®ç¨®é¡å¿ãã¦ã次ã«æ¥ããªã³ã¯ãActiveã¾ãã¯Savedã®ã¢ã¤ãã ã®ãªã³ã¯ã¨èªèãã
ã¨ããã¢ããã¼ããåã£ã¦ãã¾ããã
def parse next_item = nil @page.parser.search("a").each do |a| name = a.get_attribute("name") || "" url = a.get_attribute("href") || "" case name when /^\d+$/ next_item = :active next when /^s\d+$/ next_item = :saved next end if not next_item.nil? url =~ %r{/ASIN/([^/]+)/} if(next_item == :active) @active_items.push AmazonOrganizer::Item.new($1) else @saved_items.push AmazonOrganizer::Item.new($1) end next_item = nil next end end end
ããã¼ãã¾ããã©ãæ¸ãã¦ãããç¨åº¦ãã ãã ãªã®ã¯ãããããªãã
ä»äºæã¹ãã¼ããã·ã³ã«ã¯é¦´æãããã®ã§ãããããã³ã¼ãã«ã¯ãã¾ãæµæããªãã£ããâ¦
åé¡ç¹
ä¸ã®ããæ¹ã®åé¡ç¹ã¯ãä¾¡æ ¼ã®æ
å ±ãã¨ã£ã¦ããã®ãé£ããã¨ããã
ä¾¡æ ¼ã¯
<b class="price">ï¿¥ 3,360</b><br />
ãããªæãã§å
¥ã£ã¦ã¾ããçããclassãè¨å®ããã¦ãã®ã§ãæ¾ãã®ã¯ç°¡åãªãã§ãã()ãããã®ããä½ç½®ãã©ãè¦ä»ããããåé¡ã
ä¸ã®ã¢ã¤ãã ãæ¤åºããæ¹æ³ã¯ããã¾ã§aã¿ã°ãè¦ä»ããã«ã¼ãã®å¦çã§ãããªãã®ã§ãè¦ä»ãã£ãä½ç½®ããã次ã«ã§ã¦ããclassãpriceã®bã¿ã°ãã¨ããã®ãè¦ä»ããã®ãé£ããããã§ãã
XPathã§ã©ã¯ã©ã¯
å®ã¯ä¸ã§ã¯æ¸ãã¦ã¾ããããã«ã¼ãã¯tableã§æ§æããã¦ã¦ãåã¢ã¤ãã ã¯trã¿ã°ã®ä¸ã«ã¾ã¨ã¾ã£ã¦ã¾ããâã¿ãããªæã
<tr> <td> <a name="1"/> <input name="saveForLater.1" alt="ä»ã¯è²·ããªã" /> </td> <td> <a href="http://www.amazon.co.jp/exec/obidos/ASIN/4877832068/ã»ã«ãããã">ããã«ã¿ã¤ãã«</a> </td> <td> <b class="price">ï¿¥ 3,360</b><br /> </td> </tr>
è¦æãããã«å¤§åã¯ããã£ã¦ã¾ãããæ§é ã¯ããããå½¢ã
ãªã®ã§ã以ä¸ã®äºãã§ããã°ãªã³ã¯ããã¿ã¤ãã«ããä¾¡æ ¼ãæãåºãäºãã§ãããã§ãã
- ã¢ã¤ãã ãæ ¼ç´ãã¦ããtrã®ãã¼ããé çªã«å¦çãã
- trã¿ã°ã®ä¸ã«ã¶ãããã£ã¦ããã¿ã°ãæãã ã
ã¾ãã«XPathãå¾æã¨ããããªæãã§ã¯ããã¾ããã*3ã
ãã ã件ã®nameä»ãaã¿ã°ã¯ååãçããã¦ã¡ãã£ã¨ä½¿ãã¥ããããªã®ã§ãtrãè¦ä»ãã¦ããç®å°ã¨ãã¦inputã¿ã°ã使ãã¾ããnameãsaveForLater.ã¨ãmoveToCart.sã§å§ã¾ã£ã¦ããinputãã¿ã¼ã²ããã§ãããã®è¦ªã®è¦ªãã¼ããåé¡ã®trã ã¨ããè¦ä»ãæ¹ã«ãã¦ããã¾ãã
å®éã®ã³ã¼ãã¯ãããªæã
def parse_item(key) item_rows = @page.parser.xpath("//input[starts-with(@name, '#{key}')]/../..") item_rows.collect do |row| link_to_item = row.xpath("descendant::a[starts-with(@href, 'http://www.amazon.co.jp/exec/obidos/ASIN')]") url = link_to_item.xpath("@href").to_s asin = %r{/ASIN/([^/]+)/}.match(url)[1] title = Iconv.conv("UTF-8", "SJIS", link_to_item.xpath("text()").to_s) price_string = row.xpath("descendant::b[@class='price']/text()").to_s price = /\d+/.match(price_string.sub(",", ""))[0].to_i AmazonOrganizer::Item.new(asin, {:title => title, :price => price}) end end def parse @active_items += parse_item("saveForLater.") @saved_items += parse_item("moveToCart.s") end
ã¾ã以ä¸ã®é¨åãtrã¿ã°ã®ä¸è¦§ãã²ã£ã±ã£ã¦ãã¦ãã¨ããã§ãã@pageã¯WWW::Mechanize::Pageã®ã¤ã³ã¹ã¿ã³ã¹ãparserã¦ã®ãNokogiriãè¿ãã¦ãã¾ãã
item_rows = @page.parser.xpath("//input[starts-with(@name, '#{key}')]/../..")
"../.."ã¨ããã¯ã¡ãã£ã¨æ±ãããããã¾ããããancestoræ¹åã«trãæ¢ãã«ããã¹ãããããã¾ããã
ã¾ãããã¯ããã¨ãããã ãã§ã
- ã¢ã¤ãã ãæ ¼ç´ãã¦ããtrã®ãã¼ããé çªã«å¦çãã
ã®ã¨ããã¯ä½ã¨ããªã£ã¦ãã¾ããããªããã§ããããã楽ã
- trã¿ã°ã®ä¸ã«ã¶ãããã£ã¦ããã¿ã°ãæãã ã
ã®é¨åãç¸å¯¾ãã¹ã®XPathã使ãã°æ¥½ã ã
ãã®trã¿ã°ã®ä¸ã«ããlinkãªãã¢ã¤ãã ã®ãªã³ã¯ã ã¨ããã£ã¦ãããã§ããããå®å¿ãã¦ä½¿ããããã§ãã
ãããªæãã
link_to_item = row.xpath("descendant::a[starts-with(@href, 'http://www.amazon.co.jp/exec/obidos/ASIN')]") url = link_to_item.xpath("@href").to_s asin = %r{/ASIN/([^/]+)/}.match(url)[1] title = Iconv.conv("UTF-8", "SJIS", link_to_item.xpath("text()").to_s)
Amazonã®HTMLã¯SJISãªã®ã§ã¿ã¤ãã«ã¯UTF-8ã«å¤æãã¦ããå¿ è¦ãããã¾ãã
priceã®æ¹ãç°¡åã«ã¨ãã ãã¾ãã
price_string = row.xpath("descendant::b[@class='price']/text()").to_s
XPath便å©ã
ããããæã«è¿ãã¨ãªããã¾ãä½ããã¾ããçãªã¨ã³ããªã¼ã§ãããããããXPathã£ã¦ãã¤ãããããããã§ããã£ãï¼
ãããããã®ãèªåã«ã¨ã£ã¦ã¯ãã¥ã¼ã¹ãªã®ã§æ¸ãããåçã¯ãã¦ãªãã
Unit Test
ãã£ããããè¦ãã ãã©ãUnit Testãã©ããããã§è©¦è¡é¯èª¤ããè¨æ¶ãããã
åºæ¬ã®æ¹åæ§ã¨ãã¦ã¯Mochaã使ã£ã¦Mechanizeã®å½ç©ãã¤ãã£ã¦ããããããã»ã¼ããã¦ãããHTMLããã¼ãºããNokogiriã®ã¤ã³ã¹ã¿ã³ã¹ãè¿ãã¨ããæ¹éã
# Mechanizeã®å½ç© class DummyAgent include Mocha::API def initialize(testcase) @testcase_path = File.dirname(__FILE__) + "/" + testcase @page = nil end attr_reader :page # urlã«å¯¾å¿ãããã»ã¼ãæ¸ã¿htmlãã¡ã¤ã«ã@testcase_pathããreadãã def get_html(url) # ... end def get(url) parser = Nokogiri::HTML.parse(get_html(url), nil, "SJIS") @page = stub(:parser => parser, :forms => make_dummy_forms(parser, url), :form_with => stub_everything) end end
ãã®æºåãã§ãã¦ãã°ã以ä¸ã®ãããªæãã§
class TC_AmazonTest < Test::Unit::TestCase def setup_amazon_stub(testcase) agent = DummyAgent.new(testcase) WWW::Mechanize.expects(:new).returns(agent) end def test_active_list setup_amazon_stub("testcase_dir1") amazon = Amazon.new() # Mechanizeã使ã£ã¦ãããããã page = amazon.get(AmazonOrganizer::Amazon::URL_SHOPPING_CART) # ãããNokogiriã使ãå¦ç (ãã®æNokogiriã¯ãã¹ãã±ã¼ã¹ã®HTMLã§ä½æããã¦ã) page.parse end end
ã¾ãå®éã®ã³ã¼ãã¨ã¯ã¡ãã£ã¨éãã¾ãããã ããããããªæãã§ãã¾ããã¹ãã§ãã¦ã¾ãã
Xpathãæã¡åºãã¾ã§ããªãæ
æ¢ã«æ¸ããéãããã¨ãã¨ã¢ã¤ãã ã®IDã¯HTMLããæãã ãã¦ããã®ã ãã©ãããæå¤ã®æ
å ±ã¯ãã®IDããã¨ã«APIããå¾ã¦ããã
ãã®æ
å ±ã¨ã¯
- ã¿ã¤ãã«
- ä¾¡æ ¼
- ç»å (smallimage)
ã§ãã¿ã¤ãã«ã¨ä¾¡æ ¼ã¯ã«ã¼ãããåãããã©ãç»åã ãã¯ãã£ã±ãå¥ã®æ¹æ³ã§èª¿éããå¿
è¦ãããã
ã©ãããã°ãããï¼
ãã¡ããæ®éã«ååç´¹ä»ãã¼ã¸ãéãã¦ãããã«å¼µããã¦ããç»åã®URLãåã£ã¦ããã°ããã
ã§ããæ°ä»ããã®ã§ãããããã楽ã«ããæ¹æ³ã«ã
ããã¯ããã¢ãã¤ã«åããã¼ã¸ã
ä¾ãã° http://www.amazon.co.jp/gp/aw/d.html?a=4798023809
å®ã¯PCåããã¼ã¸ã ã¨smallã¤ã¡ã¼ã¸ã§ã¯ãªãã¦ä¸ãµã¤ãºã®ã¤ã¡ã¼ã¸ãå¼µããã¦ãã®ã§ããã£ã¡ã使ããããªãããª
ã¨æã£ã¦ãããã§ããã¢ãã¤ã«ãã¼ã¸ãªãé½åããå°ãããµã¤ãºã®ã¤ã¡ã¼ã¸ãè²¼ã£ã¦ããã§ã¯ãªãã§ããã
ã¨ããäºã§
Net::HTTP.start("www.amazon.co.jp") do |http| response = http.get("/gp/aw/d.html?a=#{@asin}") end page = Nokogiri::HTML.parse(response.body, nil, "SJIS") @attr[:smallimage] = page.xpath("//img[contains(@src, '.jpg')]/@src").to_s
楽ã ã
ã§ãããå¾ã§æ°ä»ããã®ã ãã©ããããã«ããã¯ããããã ãããã¢ãã¤ã«ã®ãã¼ã¸ã«ã¯.jpgã¯ä¸ã¤ããè²¼ã£ã¦ãªããã ããæ£è¦è¡¨ç¾ã§ååæãã ããããºã ã
/<img src="([^"]+.jpg)">/.match(response.body)[1]
ã¨ããããªæã(試ãã¦ãªããã©)ã
ã¾ã Xpathã®ã¾ã¾ã«ãã¦ãããã©ããã®ãã¡ç´ããã
*1:ã¤ãã¤ãç解ã§ããªãèãæ¹ã ãã©ãªãâ¦ããªã³ã©ã¤ã³ã«å人æ å ±ã¤ãã§ããããªè¨é²ãæ®ãæ¹ãå«ãããªãã§ããï¼
*2:1ã¢ã¤ãã ã«ã¤ãä¸åã ããçµæã¯ãã£ãã·ã¥ããã
*3:ã¨ããäºãèããæç¹ã§ã¯XPathã使ã£ãäºãä¸åº¦ããªãã£ãã®ã§ãå ·ä½çã«ã©ããããã¯ããã£ã¦ãªãã£ãã®ã ãã©â¦ çµæçã«ã¯æ£è§£ã§ããã