ãããã¸ãã§æ¾ã£ã¦ããHTMLãNokogiriã«æ¸¡ãã¨ï¼æååããããã¨ããã£ã¦ï¼å°ã£ã¦ãï¼
Nokogiriã«æåã³ã¼ãã渡ããã®ã§ï¼HTMLããæ£è¦è¡¨ç¾ã§charsetãåãåºãã¦ï¼ä¸çªå¤ãåºç¾ããcharsetããã®ãã¼ã¸ã®charsetã¨ãã¦æ¡ç¨ããã¨ï¼ãã¾ããã£ãï¼
ãã¥ã¼ãªã¹ãã£ãã¯ã«ãã£ã¦ãã ãã ãã©ï¼ã ããããã¾ãããï¼
ãããªæãï¼
charset = io.scan(/charset="?([^\s"]*)/i).flatten.inject(Hash.new{0}){|a, b| a[b]+=1 a }.to_a.sort_by{|a| a[1] }.reverse.first[0]
before
"ã\u0082«ã\u0083¼ã\u0083\u0089ã\u0083\u0095ã\u0082ãã\u0082ãã\u0083\u0088!! ã\u0083´ã\u0082ãã\u0083³ã\u0082¬ã\u0083¼ã\u0083\u0089ã\u0080\u0080㬬6ãã±ã\u0080\u008Cã¬\u008Eã\u0081ãã\u0082«ã\u0083¼ã\u0083\u0089ã\u0082·ã\u0083§ã\u0083\u0083ã\u0083\u0097ã\u0080\u008D ã\u0080\u0090 ã\u0083\u008Bã\u0082³ã\u0083\u008Bã\u0082³ã\u008B\u0095ã\u0094»(ã\u008E\u009Fããã)"
after
"ã«ã¼ããã¡ã¤ã!! ã´ã¡ã³ã¬ã¼ãã第6話ãè¬ã®ã«ã¼ãã·ã§ããã â ãã³ãã³åç»(å宿)"
https://gist.github.com/827022
# -*- coding: utf-8 -*- require 'open-uri' require 'nokogiri' def before(url) io = URI.parse(url).read Nokogiri(io) end def after(url) io = URI.parse(url).read charset = io.scan(/charset="?([^\s"]*)/i).flatten.inject(Hash.new{0}){|a, b| a[b]+=1 a }.to_a.sort_by{|a| a[1] }.reverse.first[0] Nokogiri(io, url, charset) end puts 'before' p before('http://www.nicovideo.jp/watch/1297306177').at('title').content puts 'after' p after('http://www.nicovideo.jp/watch/1297306177').at('title').content
追è¨
ãã®æ¹æ³ã ã¨ã¯ã¦ãªã°ã«ã¼ããEUC-JPã ã¨æããã¦ãã¾ãï¼æåã³ã¼ãã®ã¡ããï¼ããã®ã¦ã§ããµã¼ãã¹ã®JavaScriptãèªã¿ããã§ãã¨ï¼ãããªãï¼å°ã£ãï¼
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <link rel="stylesheet" href="/diary_css/base.css" type="text/css" media="all" charset="euc-jp"> <link rel="stylesheet" href="/theme/hatena/hatena.css" type="text/css" media="all" charset="euc-jp"> <script type="text/javascript" src="http://d.hatena.ne.jp/js/quick_pager.js" charset="euc-jp"></script>
追è¨
require "open-uri" ; open(url).charset ã¯ã©ãã§ããã http://rurema.clear-code.com/1.9.2/library/open=2duri.html
はてなブックマーク - sonota88のブックマーク - 2011年2月16日
open-uriã®charsetã¯ï¼ã ãããåã£ã¦ãã¦ï¼ããæãã ãã©ï¼ãã³ãã³åç»ãiso-8859-1ã«ãªã£ãããã¦ãã¾ãï¼
ã®ã§ï¼open-uriã®charsetãiso-8859-1ãããªãã¨ãã¯charsetãé©å½ã«æ¢ãã¦ä½¿ãï¼ã¨ãããã¨ã«ããï¼
id:sonota88ãããããã¨ããããã¾ãï¼
def after(url) io = URI.parse(url).read charset = io.charset if charset == "iso-8859-1" charset = io.scan(/charset="?([^\s"]*)/i).flatten.inject(Hash.new{0}){|a, b| a[b]+=1 a }.to_a.sort_by{|a| a[1] }.reverse.first[0] end Nokogiri(io, url, charset) end
before
# "ã\u0082«ã\u0083¼ã\u0083\u0089ã\u0083\u0095ã\u0082ãã\u0082ãã\u0083\u0088!! ã\u0083´ã\u0082ãã\u0083³ã\u0082¬ã\u0083¼ã\u0083\u0089ã\u0080\u0080㬬6ãã±ã\u0080\u008Cã¬\u008Eã\u0081ãã\u0082«ã\u0083¼ã\u0083\u0089ã\u0082·ã\u0083§ã\u0083\u0083ã\u0083\u0097ã\u0080\u008D ã\u0080\u0090 ã\u0083\u008Bã\u0082³ã\u0083\u008Bã\u0082³ã\u008B\u0095ã\u0094»(ã\u008E\u009Fããã)"
after
"ã«ã¼ããã¡ã¤ã!! ã´ã¡ã³ã¬ã¼ãã第6話ãè¬ã®ã«ã¼ãã·ã§ããã â ãã³ãã³åç»(å宿)"
before
"å¾©è® - eigokunã®æè¨ - - -"
after
"å¾©è® - eigokunã®æè¨ - - -"