ææãæ¥ä¸å¤å¥
Twitter ä¸ã§ã id:showyou ããããã°ãã¼ã¿ã®æ¥ä¸è¨èªå¤å¥ããããã¨ãã話ããã¦ããã®ã§ãããã«ã¤ãã¦ã
ã¾ãåæã¨ãã¦ãæåã ãè¦ã¦ãæ¥æ¬èªã¨ç°¡ä½åä¸å½èªï¼ç¹ä½åã¯ãã£ã¨é¢åã ãã©ãããã§ã¯ãã£ããæ£ä¸ãï¼ã 100ï¼
å¤å¥ãããã¨ã¯ã§ããªãã
ã¨ããã®ã¯ãç°¡ä½åä¸å½èªã®æç« ã§ãã£ã¦ãå¿
ãããç°¡ä½åãå«ãããã§ã¯ãªãã®ã§ã
âççï¼âï¼æ¬å½ã«ï¼ï¼
âæåæåï¼âï¼ããã§ã¨ãï¼ï¼
ãããã¯å ¸åçãªä¾ã ããå®éã¯ãã£ã¨é·ãç°¡ä½åãå«ã¾ãªãæç« ã§ãå®ã¯ä¸å½èªã¨ãããã¨ãããå¾ãã
ã¾ãããã«ããããã°æ¥æ¬èªãã¨ããç°¡åãªæ¹æ³ãããããå
¨é¨æ¼¢åã ããä¸å½èªã¨ããããã§ããªãã
ãæä½ï¼ã
ãé¢è¥¿é»æ°ä¿å®åä¼ãçã
æåãã¼ã¹ã§å¤å¥ã§ããªãã¨ãªãã¨ã精度è¯ãå¤å¥ãããªãã©ã¤ãã©ãªã使ãã®ãä¸çªã
Language Detection Library for Javaã¨ãã
ãã ããã㯠Java ã®ã©ã¤ãã©ãªãªã®ã§ãã¹ã¯ãªããè¨èªããå©ç¨ããã®ã¯é¢åã
ããã§ãæ¥æ¬èªãä¸å½èªã«ï¼ä¸å½èªãæ¥æ¬èªã«ãã¨ããã¨ã©ã¼ã®ãã¡ãã©ã¡ããã許容ã§ããã¨ãããªããPerl çã§ç°¡åã«å¦çãããã¨ãã§ããã
ã¾ããæ¥æ¬èªãä¸å½èªã«ééãã¦ãããå ´åã
ãããªãããæ¼¢åããã£ã¦ãã«ãããªããã°ä¸å½èªããæã£åãæ©ãï¼è¶ ææãã®å ´åãä¸ã§æ¸ããããã«ãå ¨é¨æ¼¢åã ããä¸å½èªã¨ã¯éããªããããã¡ãã£ã¨ãã·ãªã®ã¯æå¾ã«ï¼ã
use strict; use utf8; my $str = '(å¤å¥ãããããã¹ã)'; my $is_zh = 1 if $str =~ tr/\x{4e00}-\x{9fff}// and $str !~ tr/\x{3041}-\x{3093}\x{30a1}-\x{30f6}//;
次ã¯ãä¸å½èªãæ¥æ¬èªã«ééãã¦ãããå ´åã
ãã®å ´åããJISã«ãªãæ¥æ¬èªã®ç°¡ä½åãããã°ä¸å½èªããããã
ç°¡ä½åãã©ããã®å¤å¥ã¯ãhttp://www.unicode.org/Public/UNIDATA/Unihan.zip:Unihan.zipãè½ã¨ãã¦ãã¦ãkTraditionalVariant ãããã°ç°¡ä½åã¨ãããã
ãã ããã®åºæºã§ãç°¡ä½åãã ããä¸å½èªãã¨ãã¦ãã¾ãã¨ã²ã©ããã¨ã«ãªãã
æ¥æ¬ã®ç¥åä½ã«ã¯ãå½ããä½ããªã©ãä¸å½ã®ç°¡ä½åã¨åããã®ãããã®ã§ã
ã ããããç°¡ä½åã§ãã㤠JIS ã«ãªãæåãã¨ããã®ãããã
ãã ããã®å¦çã®ç²¾åº¦ã§ããã°ãUnihan ãè¦ãã¾ã§ããªããç°¡ä½åä¸å½èªã®æåã³ã¼ãã«å¤æã§ãã¦ãæ¥æ¬èªã«å¤æã§ããªãæåãå«ãããªãä¸å½èªãã¨ããã®ãç°¡åã§ããã
ç°¡ä½åä¸å½èªã®æåã³ã¼ãã¨ããã®ã¯ãå ·ä½çã«ã¯ GB2312ï¼GBKã»GB18030çã¯å¤§ãããã¦æ¥æ¬èªã®æåã¾ã§å«ãã®ã§ãã¡ï¼ã
Perl ã§æ¸ãã¨ãããªæãã
use strict; use utf8; use Encode; my $enc_gb2312 = find_encoding('gb2312'); my $enc_cp932 = find_encoding('cp932'); my $str = '(å¤å¥ãããããã¹ã)'; my $questions_before = ($str =~ tr/?//); my $questions_gb2312 = ($enc_gb2312->encode($str) =~ tr/?//); my $questions_cp932 = ($enc_cp932->encode($str) =~ tr/?//); my $is_zh = 1 if $questions_before == $questions_gb2312 and $questions_cp932 > $questions_before;
ããã§ã¯æ¥æ¬èªã®æåã³ã¼ã㯠cp932 ã¨ãããã©ããã«ãåé¡çãããã®ã§ããã®ã¸ãã®å¦çãå¿ è¦ããã
ã¡ãªã¿ã«ããã®å¦çã® gb2312 㨠cp932 ã®åºæºãéã«ããã¨ãæ¥æ¬èªãä¸å½èªã«ééãã¦ãããå ´åãã«ä½¿ããããã®æã¯ããã«ããããã°æ¥æ¬èªç¢ºå®ãã¨ããä¸ã§ï¼éã«ããï¼ãã®å¦çãè¡ãã°ãããä¸ã®ãããã¡ãã£ã¨ãã·ãªï¼ãé¢è¥¿é»æ°ä¿å®åä¼ããæ¥æ¬èªã¨å¤å¥ãããï¼å¦çã«ãªãã
use strict; use utf8; use Encode; my $enc_gb2312 = find_encoding('gb2312'); my $enc_cp932 = find_encoding('cp932'); my $str = '(å¤å¥ãããããã¹ã)'; my $is_zh = 0; if ($str !~ tr/\x{3041}-\x{3093}\x{30a1}-\x{30f6}//) { my $questions_before = ($str =~ tr/?//); my $questions_gb2312 = ($enc_gb2312->encode($str) =~ tr/?//); my $questions_cp932 = ($enc_cp932->encode($str) =~ tr/?//); $is_zh = 1 unless $questions_before == $questions_cp932 and $questions_gb2312 > $questions_before; }