Web::Queryã§ãã£ãã·ã¥æ©è½ãæããã
éçºç¨ã«å¦çå
容ãã¡ããã¡ããæ¸ãæããªããã¹ã¯ã¬ã¤ãã³ã°ãåãã¾ããã¨ãåãåããå
ã®Webãµã¼ãã¼ã«ãªã¯ã¨ã¹ãæãã¾ãããã§ã¡ãã£ã¨è¿·æãããã¾ãã
ããã§ãã£ãã·ã¥ãå®è£
ãã¦ã¿ããã§ããä¸çªç°¡åãªã®ã¯HTTP::Cache::Transparentã§ãã
æé ã¯ããã ã
- HTTP::Cache::TransparentãWeb::Queryããå¾ã«useãã
- å®éã«Web::Queryã使ãåã«HTTP::Cache::Transparentãinitãã
å®é¨ç°å¢ã¯ãã¤ãã®ãããVPSã
â å
ã®ã³ã¼ã
ãããVPSã§ã¨ããããWeb::Queryã使ããããã«ãã¦ã¿ã
http://sakuragaoka.hatenadiary.jp/entry/2013/06/07/201740
â åè
http://d.ballade.jp/blog/2008/03/lwpget_1b79.html
â ã¤ã³ã¹ãã¼ã«
$ sudo cpan HTTP::Cache::Transparent
â 使ã£ã¦ã¿ã
- BasePath: ãã£ãã·ã¥ãä½ãå ´æ
- NoUpdate: ãã®éï¼ç§ï¼ã¯Webãµã¼ãã¼ã«å度åãåãããããªã
- MaxAge: ãã®éï¼æéï¼ãã£ãã·ã¥ãã¡ã¤ã«ãä¿æãã
- Verbose: 1ã«ããã¨ç»é¢ã«åé·ãªåºåããã
#!/usr/bin/env perl use utf8; use strict; use warnings; use Web::Query; use HTTP::Cache::Transparent; binmode(STDOUT, ":utf8"); HTTP::Cache::Transparent::init({ BasePath => '/var/www/XXXXXX/batch/httpcache/wq', NoUpdate => 60*60*24*7, # sec MaxAge => 24*365, # hour Verbose => 0, }); wq('http://www.goo-net.com/catalog/')->find('div.box_searchUsedCar ul.line li a')->each(sub{ $_[1]->each(sub{ my(undef, $wq) = @_; my $name_j = $wq->text(); my $name = $wq->attr('href'); $name =~s|^/catalog/(\w+)/index.html$|$1|; print qq|$name ($name_j)\n|; }) }); exit;
â çµæ
ãã£ãã·ã¥æãç¡ãã®éãã ããªãã§ãå®è¡ãã¦ãæ¹å¤åã®ã³ã¼ãã¨åºåã¯å¤ãããªã訳ã§ãããã£ãã·ã¥ãåºæ¥ã¦ãã®ã¯ç¢ºèªåºæ¥ã¾ãã
$ ls -l /var/www/cardata/batch/httpcache/wq total 48 -rw-rw-r-- 1 sakuragaoka 49024 Jul 11 13:43 a09f5047f9c2d2e9959587f9ac732c17
Spidering hacksâã¦ã§ãæ
å ±ã©ã¯ã©ã¯åå¾ãã¯ããã¯101é¸