HTML::TreeBuilder + CSSã»ã¬ã¯ã¿ãããæããªä»¶
å æ¥ Perlã§CSSã»ã¬ã¯ã¿ 㧠HTML::Selector::XPath ãããæãã§ããã¨æã£ãããã§ãããCSS ã»ã¬ã¯ã¿ã ããããªãä½æ°ã« HTML::TreeBuilder::XPath ã¨ã®ã³ã³ããããã¼ã¤ã¤!ã¨ãããã¨ã«ãã¾ããæ°ã¥ãã¾ããã
HTML::TreeBuilder::XPath 㧠findnodes ããã¨ããªã¼ç¶ã«é£ãªã£ã HTML::Element ãªãã¼ã¿æ§é ãè¿ã£ã¦ãããã§ãããHTML::Element 㯠API ãããªãããããæã£ã¦ã¦ãããããã¾ã使ã£ã¦ããã¨ã¹ã¯ã¬ã¤ãã³ã°ãèªç¶ãªæãã§æ¸ãã¾ãã
ä¾ãã°ã¯ã¦ãªãã¤ã¢ãªã¼ã®ä»»æã®ãã¼ã¸ãããæ¬æé¨åã ããã¹ã¯ã¬ã¤ãã³ã°ãããã¨æã£ãã¨ãã«ãã¼ã¯ã¼ããªã³ã¯ãéªéã ã£ããããããã§ãããã¨ãããã HTML::Selector::XPath 㧠div.section ãã¶ã£ãæãã¦åãã HTML::Element ã« as_text ãå¼ã¹ã°ãã¼ã¯ã¼ããªã³ã¯ã¨ããç¡è¦ãã¦ããã¹ãã«ã§ãã¾ãã
ã¤ã¾ã
# $tree ã¯ãã¤ã¢ãªã¼ã® HTML ããã¼ãºãã HTML::TreeBuilder::XPath $tree->findnodes(selector_to_xpath('div.section'))->shift->as_text;
㧠OK ã¨ããããç´ ã§æ¸ããã¨ãã㨠div.section ãæ£è¦è¡¨ç¾ã§ã¶ã£ãã¬ãããã¨ã¿ã°ã£ã½ããã®ãæ£è¦è¡¨ç¾ã§åé¤ãã¦...ã¨ããã³ã¼ãã«ãªã£ã¦æ±ããªããã¡ãSelector::XPath + TreeBuilder::XPath ãªãæ£è¦è¡¨ç¾ã¯ä¸è¡ãæ¸ããã«åããã¨ãã§ãã¾ãã
åã«ããã¹ãã«ããã ãã ã¨ããã¾ãããããã¿ããªãã§ãããä¾ãã° HTML ã¨ãã¦ã¶ã£ãã¬ãã¤ã¤
- ãã¼ã¯ã¼ããªã³ã¯ã¯ã¿ã°ãåé¤ãã¦ããã¹ãã ãæ®ã
- ããã°ã¢ã¼ãã®ã¨ãã® "Permalink ... " ã¨ãæ¸ãã¦ãè¡ã¯æ¶ã
- ã¿ã¤ãã«ãæ¶ã
- ãã以å¤ã®ã¨ããã¯æ®ã
ãªãã¦æã¯?
HTML::Element ã® look_down ã§ãdiv.section è¦ç´ ã«ã¶ãä¸ãã£ã¦ãåè¦ç´ ã§ãããªããã®ãæ¢ãã¤ã¤ãdelete ã replace_with_content ãå¼ãã§ããã¾ãããããªæã㧠css ã»ã¬ã¯ã¿ã§æ½åºããã¨ã¬ã¡ã³ããããã£ã¦å¤å½¢ããããã¨ãã§ãã¾ãã
$_->delete for $section->look_down(_tag => 'h3'); $_->delete for $section->look_down(_tag => 'p', class => 'sectionfooter'); $_->replace_with_content for $section->look_down(_tag => 'a', class => 'keyword');
ã¨ããå ·åã
ã¹ã¯ã¬ã¤ãã³ã°ã®å ¥ãå£ã CSS ã»ã¬ã¯ã¿ã§ç°¡åã«ãªãã ããããªããã¹ã¯ã¬ã¤ãã³ã°ãã¦åã£ã¦ããè¦ç´ ãæ´ã« HTML::TreeBuilder ãæ§ç¯ãããã¼ã¿æ§é ã§ããããã¨ããããã®ã§æ£è¦è¡¨ç¾ã¬ã¹ã§ã ãããã®ãã¨ãã§ããã¨ããããããè±åã
#!/usr/local/bin/perl use strict; use warnings; use URI::Fetch; use HTML::Selector::XPath qw/selector_to_xpath/; use HTML::TreeBuilder::XPath; use HTML::Entities qw/decode_entities/; my $uri = shift or die "$0 <uri>"; my $res = URI::Fetch->fetch($uri) or die URI::Fetch->errstr; my $section = $res->tree->select_by_css('div.section')->shift; my $html = format_for($section)->as_XML; decode_entities($html); print $html, "\n"; sub format_for { my $section = shift; $_->delete for $section->look_down(_tag => 'h3'); $_->delete for $section->look_down(_tag => 'p', class => 'sectionfooter'); $_->replace_with_content for $section->look_down(_tag => 'a', class => 'keyword'); $section; } sub URI::Fetch::Response::tree { my $res = shift; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse($res->content); $tree->eof; $tree; } sub HTML::TreeBuilder::XPath::select_by_css { my ($tree, $css) = @_; $tree->findnodes(selector_to_xpath($css)); }
HTML::TreeBuilder ãæ5åãªã®ããããããã¾ãããæ°ã¥ãã®ããã¼ããã¨ããæãã§ããã
HTML::TreeBuilder ãç©æ¥µçã«æ´»ç¨ããå ´é¢ã£ã¦ããã®ã¯ããã¾ã§ããã¾ããªãã£ããã§ãããHTML::Selector::XPath 㧠CSS ã»ã¬ã¯ã¿ã§ããªã¼ãã²ããã§ããããã«ãªã£ã¦å ¥ãå£ã®éå£ãã¨ã£ã¦ãä¸ãã£ãæãã§ãä»å¾ã¯ãã¡ããã¡ã使ããããªäºæã§ãããã¾ãã