Perlã§æ¥æ¬èªæååãæååããã¦ããã©ããæ¨æ¸¬ããï¼ä¿®å¾©ãã
ã¡ãã£ã¨æè¿Buzzurlã«èªä½ã¹ã¯ãªãããä½ãã§ã大éã®äºéã¨ã³ã³ã¼ãæååãå«ãããã¯ãã¼ã¯ãæ稿ãããã®ã§å¯¾çã®ããã«èª¿ã¹ã¦ã¿ããã¨ã®ã¾ã¨ãã<追è¨>id:miyagawaããã®ãã¯ã㧠Encode::DoubleEncodedUTF8 ã¨ããã¢ã¸ã¥ã¼ã«ãæãã¦ãããã¾ããã調ã¹ããä½è
ãid:miyagawaãããäºéã¨ã³ã³ã¼ãæ¯æ£ã«ã¯ãã¡ãã使ãããã«ãã¾ãããã
ã§ããã"äºéã¨ã³ã³ã¼ã perl utf8"ã¨ãã§ããã£ããã©è¦ã¤ãããªãã£ãâ¦ãid:miyagawaããã®ããã°ã¨ããã£ã¨æ¤ç´¢ã«å¼ã£ãããã¹ãã ã¨æãã®ã ãã追è¨>
Perlã§UTF8æååã使ãã¨ãã®åå
Perlã§UTF8æååãæ±ããªãã°ãEncodeã®ç¥ã§ããã¨ããã®id:dankogaiãä½åº¦ãä½åº¦ãå£ããã£ã±ããã¦è¨ã£ã¦ãã次ã®ååã«å¾ããªããã°ãªããªããããããªãã¨ãããä¸æå¿«ãªç®ã«ããã
å ¥ãå£ã§ decode ãã¦ãå é¨ã§ã¯ãã¹ã¦ flagged utf8 ã§æ±ããåºå£ã§ encode ãããããããã¹ã¦ã§ãï¼ã¨ã«ãããã®åºæ¬æ¹éãã¾ãã£ã¦ããã°å¹¸ãã«ãªãã¾ãã
ãããããã§ãæååããè¦ãå ´åããããCPANã¢ã¸ã¥ã¼ã«ãªã©ã§ããå ¥ãå£ã§decode/åºå£ã§encodeãååã«å¾ã£ã¦ããã®ã ããã¨æå¾ ãã¦encodeæ¸ã¿ã®ãã¤ããªåã渡ãã¦ã¿ããæååãã¦ããªããã¨èª¿ã¹ã¦ãããã¢ã¸ã¥ã¼ã«ã¯å¼æ°ã¨ãã¦decodeãããflagged utf8æååãæå¾ ãã¦ãã¦æååããã¨ããã¾ãããã¯æ®éã«ãã¹ããã¦ããã°æ¤åºã§ããã®ã§ãã¾ãåé¡ã«ãªããªãã
ãã¨æã¯ãflagged UTF8æååã«å¯¾ãã¦ããã«decode_utf8ãããã¨ã«ããæååããçºçãããããªæ°ãããã(試ããã¨æã£ããä»æå ã«ç°å¢ããªã)ã2.13以éã®æ°ããEncodeã§ã¯flagged UTF8æååã«decode_utf8ãã¦ãä½ãããªãããã«ãªã£ãã
ã¨ãªãã¨ãç¾å®ã«è¦ãæååããã¿ã¼ã³ã¯ä»¥ä¸ã®2ã¤ã§ãããã
- flagged UTF8æååã¨ãã¤ããªã®æååçµå
- äºéã¨ã³ã³ã¼ã
äºéã¨ã³ã³ã¼ã
æååçµåã¯ããã¨ãã¦ãäºéã¨ã³ã³ã¼ãã¨ã¯ä½ããããã¯ãUTF8ã¨ã³ã³ã¼ãã£ã³ã°ããã¦ãããã¤ããªãããã«UTF8ã¨ã³ã³ã¼ãã£ã³ã°ããã¨ãã«èµ·ããä¸å
·åã ã
ã©ããããã¨ããä¾ãã°"ECãã"ã¨ãããã£ã©ã¯ã¿åã¯ãUTF8ã§ã¯[(0x45) (0x43) (0xE3 0x83 0x8A) (0xE3 0x83 0x93)]ã¨è¡¨ç¾ããã(â»()ã¯åºåãã®ããã®è¡¨ç¤ºã§ããã¡ãããã¤ããªè¡¨ç¾ã«ã¯åå¨ããªã)ãä½ãã®é½åã§(ä¾ãã°Encode::encodeã«å¯¾ããç解ä¸è¶³ã¨ã)ããã®ãããªUTF8ã®ãã¤ããªè¡¨ç¾ã«å¯¾ãã¦ããã«UTF8ã¨ã³ã³ã¼ãã£ã³ã°ããã¦ãã¾ããã¨ãããããã
ããã¨ã©ããªãããUTF8ã¨ããã¨ã³ã³ã¼ãã£ã³ã°ã¯è«ççãªæå³ã¯ã¨ãããç©ççã«ã¯21bit(ã¾ãã¯31bit)ã¾ã§ã®ä»»æã®ãããåãã¨ã³ã³ã¼ãã£ã³ã°å¯è½ãªã®ã§ãåãã¤ãã7ã¾ãã¯8bitã®ãããåã¨ãã¦UTF8ã¨ã³ã³ã¼ãããããã¨ã«ãªããUTF8ã§ã¯7bitã®ãããåã¯1ãã¤ãã§è¡¨ç¾ããã0x00ã0x7Fã®éã§ããã8ã11bitã®ãããåã¯2ãã¤ãã§è¡¨ç¾ããã 0xC080 ã 0xDFBF ã®éã§ããã"ECãã"ã®UTF8è¡¨ç¾ [0x45 0x43 0xE3 0x83 0x8A 0xE3 0x83 0x93] ããã®ããã«äºéUTF8ã¨ã³ã³ã¼ãããã¨æ¬¡ã®ããã«ãªãï¼[(0x45) (0x43) (0xc3 0xa3) (0xc2 0x83) (0xc2 0x8a) (0xc3 0xa3) (0xc2 0x83) (0xc2 0x93)]
use strict; use warnings; use utf8; use Encode; my $utf8str = "ECãã"; my $utf8bin = encode_utf8($utf8str); my $fuckbin = encode_utf8($utf8bin); print $utf8bin, "\n"; print $fuckbin, "\n";
èªåã注ææ·±ããã°ãã®ãããªè ã£ãUTF8ãã¤ããªãä½ãåºããªãã¦æ¸ãããåé¡ã¨ãªãã®ãå¤é¨ãããã®ãããªè ã£ãUTF8ãã¤ããªãæµãè¾¼ã¾ããã¨ãã§ãããç¡æå³ãªè¡¨ç¾ãå«ããããããªãããå°ãªãã¨ãç©ççã«ã¯"æ£ããUTF8ãã©ã¼ããã"ãªã®ã§ãdecode_utf8ã¯ããªããæã£ã¦ã¯ãããªãã
äºéã¨ã³ã³ã¼ããæ¤åº
ãã®ãããªè ã£ãUTF8ãã¤ããªãæ¤åºã§ããã ãããï¼ ãã®ãããªUTF8ã¯ãã¹ã¦1ã2ãã¤ã表ç¾ã§æ§æããããã幸ããªãã¨ã«ã»ã¨ãã©ã®æ¥æ¬èªãã£ã©ã¯ã¿ã¯UTF8ã§ã¯3ãã¤ãã§è¡¨ç¾ãããããã主ã«æ¥æ¬èªãã£ã©ã¯ã¿ã ãã使ã£ã¦ããå ´åã¯ãå®å ¨ã¨ã¯ãããªããäºéã¨ã³ã³ã¼ãã£ã½ãæååãã©ãããå¤å®ãããã¨ãã§ããã
sub only2bytes { my $ascii = my $to07b = "[\x{00}-\x{7f}]"; my $chr2b = my $to11b = "[\x{c0}-\x{df}][\x{80}-\x{bf}]"; my $re = qr/^($ascii|$chr2b)+$/o; #Encode2.13以éã§ã¯decode_utf8()ã¯äºéãã³ã¼ãã®å¿é ã¯ãªãã®ã§ã #å®å ¨ã«encodeããããã«decode_utf8()ãã¦ããencode_utf8() my $bin = encode_utf8(decode_utf8(shift)); $bin =~ /$re/ } my $utf8bin = encode_utf8("ECãã"); my $fuckbin = encode_utf8($utf8bin); warn( (only2bytes($utf8bin)) ? "fuck" : "valid" ); warn( (only2bytes($fuckbin)) ? "fuck" : "valid" );
ã¦ã©ã¸ãã¼ã«ãã¼ãã³æ¤åºé¢æ°
ãã¦ããã»ã¨ãã©ã®ãã¨ãã£ããã2ãã¤ã表ç¾ã®ã¿ãããªããããªæ¥æ¬èªæååã¨ããã®ã¯åå¨ããªãã®ã ãããï¼(ã¦ãã³ã¼ãã«ããã¦"æ¥æ¬èª"ã¨ããã®ã¯å¾®å¦ãªåé¡ã§ã¯ããããããWebã¢ããªãªã©ãä½ãã®ã§ããã°ä¾ãã°ãä»ã¯EUC-JPãCP932ã¨ç¸äºå¤æã§ããç¯å²å
ã«ã¤ãã¦ãã¹ããããã¢ã©ãã¢æåãµãã¼ãã«ã¤ãã¦ã¯ã¤ã©ã³å¸å ´ã«ãµã¼ãã¹å±éããã¨ãã«äºç®ãåãããªã©ã¨ããç¾å®è·¯ç·ã¯èããããã)
çµè«ããããã¨ããããå®å
¨ãªãã®ãã©ããåãããªãããã°ã°ã£ã¦åºã¦ããã³ã¼ã対å¿è¡¨ããgrepããããä¾ãã°"ÐÐ»Ð°Ð´Ð¸Ð¼Ð¸Ñ ÐÑÑин"(ã¦ã©ã¸ãã¼ã«ãã¼ãã³)ãªã©ã¯UTF8ã«ããã¨2ãã¤ã表ç¾ã®ã¿ãããªããä¸é¨ã®è¨å·ã¨ãã®ãªã·ã£æåãããªã«æåãªã©ã該å½ããã
ãããã«ããããæååããµãã¼ãããªãã¨ãªãã¨KGBã«æ¶ãããεÏιÏÏημη(ãã´ãã¦ã¼ãã¼)æ°ã«å¤±ç¤¼ãªã®ã§ãããããå ´åã¯æãããã«ãã¦ã¿ããã
sub is_putin { my $ascii = my $to07b = "[\x{00}-\x{7f}]"; my $jpn2b = "[´ ¨ ± à ÷ ° ¢ £ § ¬ ¶Î-Ωα-ÏÐ-Яа-Ñ]"; my $re = qr/^($ascii|$jpn2b)+$/o; my $str = decode_utf8(shift); $str =~ /$re/ } my $putin = "ÐÐ»Ð°Ð´Ð¸Ð¼Ð¸Ñ ÐÑÑин"; warn( (only2bytes($putin)) ? "fuck" : "valid" ); warn( (is_putin($putin)) ? "ã¦ã©ã¼ã¼ï¼ï¼" : "fuck" );
ãã¨ã¯çµã¿åãããã ã
æ¥æ¬èªã®ã¿ã®ãµãã¼ãã¨ã¯ãããæ¤åºããã§ããã°ã修復ã¯ç°¡åã§ãããäºéã¨ã³ã³ã¼ãããã¦ããã®ã ãããããä¸åº¦decodeãã¦ããã°ããã(decode_utf8ã¯flagged UTF8ã«å¯¾ãã¦ä½ãããªãã®ã§ãäºåº¦ç®ã®ãã³ã¼ãã¯decodeãå¼ã¶å¿ è¦ãããç¹ã«æ³¨æ)
sub smart_decode_utf8 { my $utf8bin = shift; my $tmp = decode_utf8($utf8bin); (is_dual_encode($tmp)) ? decode("utf8", $tmp) : $tmp; } sub is_dual_encode { my $text = shift; only2bytes($text) && !is_putin($text); } my $str = "ECãã"; print smart_decode_utf8($str), "\n"; my $fuck = encode_utf8(encode_utf8($str)); print smart_decode_utf8($fuck), "\n"; my $name = "ÐÐ»Ð°Ð´Ð¸Ð¼Ð¸Ñ ÐÑÑин"; print smart_decode_utf8($name), "\n";
以ä¸ã§ããééã£ã¦ããã¨ããããã£ãããã£ã¨id:dankogaiãä½ã¨ããã¦ãããã
äºéã¨ã³ã³ã¼ãã«é¢ããé¢é£æ å ±