HTML::Tidy::LibXML �� Release �����ΤǤ��Τ餻���ޤ���
- /lang/perl/HTML-Tidy-LibXML/trunk - CodeRepos::Share - Trac
- Dan Kogai / HTML-Tidy-LibXML - search.cpan.org
- http://www.dan.co.jp/~dankogai/cpan/HTML-Tidy-libXML-0.02.tar.gz
���ä����ϡ������顣
XML::LibXML��HTMLʸ��򰷤� - �̽������ǻȤä�$parser->parse_html_file()
�������XMLʸ���Ѥ�$parser->parse_file()
��ξ�᥽�åɤǤ������åµï¿½ï¿½ï¿½Ì¤ï¿½(̾���ΰ��ݤ�ȿ����)�ե�����̾�����Ǥʤ�URL���Ϥ����Ȥ��ǽ�Ǥ����ʤΤ�LWP�⥸�塼���Ȥ�ʤ��Ƥ⡢XML::LibXML�����ǥͥåȥ�����XML/HTML�ե������������Ʋ��Ϥ��뤳�Ȥ��Ǥ��Ƥ��ޤ��ޤ�����
�Ȥ����������ε�ǽ���ç¤ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ò¤«¤ï¿½ï¿½ï¿½ï¿½Æ¤ï¿½ï¿½Þ¤ï¿½ï¿½ï¿½
�ʲ�����������
- �ޤ���̿Ū�ʤΤ���XML::LibXML��ľ��URL��fetch������硢
Content-Type:
�إå����򤷤��Ȥ��Ƥ���뤳�ȡ�������< meta http-equiv="Content-Type" content="text/html; charset=whatever">
�ϥ����å����Ƥ�����㤦�� - ��������ʤΤ���
parse_html_file
���ɤ������XML����ȡ�<br clear="">
�Ȥ�������Ǥϥ֥饦�����Ͽ����ʤ���
����Ǥ�XML::LibXML�Ϸ빽��®�ʤΤǡ��������줿DOM�������äȼ�ľ�������HTML::Tidy������ˤʤ�ΤǤϤʤ���....
�ȹͤ��ƺ�ä��Τ������Υ⥸�塼��Ȥ����櫓�Ǥ����ʲ���POD���ȴ�衣
NAME HTML::Tidy::libXML - Tidy HTML via XML::LibXML VERSION $Id: libXML.pm,v 0.2 2009/02/21 11:47:58 dankogai Exp dankogai $ SYNOPSIS use HTML::Tidy::libXML; my $tidy = HTML::Tidy::libXML->new(); my $xml = $tidy->clean($html, $encoding); # clean enough as xml my $xhtml = $tidy->clean($html, $encoding, 1); # clean enough for browsers EXPORT none. Functions new Creates an object. my $tidy = HTML::Tidy::libXML->new(); html2dom my $dom = $tidy->html2dom($string, $encoding); This is analogus to my $lx = XML::LibXML->new; $lx->recover_silently(1); my $dom = $lx->parse_html_string($string); Except one major difference. HTML::Tidy::LibXML does not trust "<meta http-equiv="content-type" content="text/html; charset="foo">" while XML::LibXML tries to use one. Consider this; my $dom = $lx->parse_html_string('http://example.com'); This kinda works since XML::LibXML is capable of fetching document directly. But XML::LibXML does not honor HTTP header. Here is the better practice. require LWP::UserAgent; require HTTP::Response::Encoding; my $uri = shift || die; my $res = LWP::UserAgent->new->get($uri); die $res->status_line unless $res->is_success; my $dom = $tidy->html2dom($res->content, $res->encoding); dom2xml my $tidy->com2xml($dom, $level); Tidies $dom which is XML::LibXML::Document object and returns an XML string. If the level is ommitted, the resulting XML is good enough as XML -- valid but not very browser compliant (like "<br clear="">", "<a name="here" />"). Set level to 1 or above for tidier, browser-compliant xhtml. html2xml my $xml = $tidy->html2xml($html, $encoding, $level) Which is the shorthand for: my $dom = $tidy->html2dom($html, $encoding); my $xml = $tidy->dom2xml($dom, $level); clean An alias to "html2xml". BENCHMARK This is what happened trying to tidy <http://www.perl.com/> on my PowerBook Pro. See t/bench.pl for details. Rate H::T H::T::LibXML(1) H::T::LibXML(0) H::T 96.2/s -- -25% -49% H::T::LibXML(1) 128/s 33% -- -31% H::T::LibXML(0) 187/s 95% 46% -- AUTHOR Dan Kogai, "<dankogai at dan.co.jp>"
�ºݤ�DOM��Tidy������ʬ�ϡ�����ʴ����Ǥ���������Level0���Ȥ�����ʬ���Ф��ޤ���
/lang/perl/HTML-Tidy-libXML/trunk/lib/HTML/Tidy/libXML.pm ? CodeRepos::Share ? Tracsub _tidy_dom { my $dom = shift; # remove empty attributes (like <br clear="">) for my $node ( $dom->findnodes('//*[attribute::*=""]') ) { for my $attr ( $node->attributes ) { next if $attr->getValue; $node->removeAttribute( $attr->getName ); } } # handle <script> for my $script ( $dom->findnodes('//script') ) { $script->getAttribute('type') or $script->setAttribute( type => "text/javascript" ); if ( $script->hasChildNodes ) { $script->insertBefore( $dom->createTextNode("//"), $script->firstChild ); $script->lastChild->appendData("\n//"); } else { # <script src="..."/> => <script src=""></script> $script->appendChild( $dom->createTextNode("") ); } } # handle <style> for my $style ( $dom->findnodes('//style') ) { $style->getAttribute('type') or $style->setAttribute( type => "text/css" ); if ( $style->hasChildNodes ) { # this one is trickier $style->insertBefore( $dom->createTextNode("/*"), $style->firstChild ); $style->lastChild->insertData( 0, "*/" ); $style->lastChild->appendData("/*"); $style->appendChild( $dom->createTextNode("*/") ); }else{ $style->appendChild( $dom->createTextNode("") ); } } # fix <img> for my $img ( $dom->findnodes('//img') ) { next if $img->getAttribute('type'); my $alt = $img->getAttribute('src'); $alt =~ s{.*/}{}o; # basename only $img->setAttribute( alt => $alt || 'img' ); } # <a name="foo"/> => <a name="foo"></a> for my $a ( $dom->findnodes('//a[@name!=""]') ) { my $empty = $dom->createTextNode(""); $a->appendChild($empty); } }
CodeRepos�Υ��ߥå����ư̡�������ʬ��ź�路�Ƥ�館��Ȥ��꤬�����Ǥ���
Enjoy!
Dan the (X?HTML|Perl) Monger
���Υ֥����˥����Ȥ���ˤ�����������ɬ�פǤ���
��������������
���ε����ˤϵ��ĥ桼�����������Ȥ��Ǥ��ޤ���