�¤ϡ�����������ɤ�������沽�򡢲桹�������ܤˤ��Ƥ��ޤ���
����桢����桢���������ˤ�밵�̸��� - naoya�ΤϤƤʥ������꡼�̾�������� 32 �ӥåȤ� 4 �Х��Ȥθ���Ĺ�ˤ��Х��ʥ����Ǥ����������ʿ�������������и������ç¤ï¿½Ê¿ï¿½ï¿½ï¿½ï¿½Ï¤Û¤È¤ï¿½É½Ð¸ï¿½ï¿½ï¿½ï¿½Ê¤ï¿½ï¿½È¤ï¿½ï¿½ï¿½ï¿½ï¿½Î¨Ê¬ï¿½Û¤Î¤ï¿½È¤Ç¤ï¿½Ìµï¿½Ì¤Ê¥Ó¥Ã¥È¤ï¿½ï¿½ï¿½Î©ï¿½ï¿½ï¿½Þ¤ï¿½ï¿½ï¿½
UTF-8�Ǥ���
UTF-8�ϡ�0x0����0x10FFFF�ޤǤ������򡢰ʲ��Τ褦�ˤ��ƥХ�������Ѵ����ޤ���
Range/Offset | 0 | 1 | 2 | 3 |
0x00-0x7F | 0xxxxxxx | |||
0x80-0x3FF | 110xxxxx | 10xxxxxx | ||
0x400-0xFFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
0x10000-0x1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
���Ƥ��̤ꡢUTF-8�μ�ˡ�Ǥϡ�0x1FFFFF�ޤǤ���������沽�Ǥ���ΤǤ������ºݤ�Unicode�Ǥ�0x1FFFFF�ǤϤʤ�0x10FFFF�ޤǤ����Ȥ��Ƥ��ޤ��󡣰����=��ʸ���������Ĺ4byte��ɬ�פʤΤǤ������Ǥ��ɤ��Ȥ���ascii�Ϲ⡹1 byte�����ˤ褯�Ȥ���Latin1��ʸ����2byte��������BMP��ˤ��뤫�ʤ�����ΤۤȤ�ɤ�3byte��Â��ƤϤ��ޤ���
������Ф���Variable Byte Code �Ǥϡ������ʤ�ޤ���
Range/Offset | 0 | 1 | 2 |
0x00-0x7F | 0xxxxxxx | ||
0x80-0x3fff | 0xxxxxxx | 1xxxxxxx | |
0x4000-0x1FFFFF | 0xxxxxxx | 1xxxxxxx | 1xxxxxxx |
���Ƥ� Unicode �������줤��3byte����˼��ޤ�塢�������ʤ�Ҥ餬�ʤ�ޤ᤿ɽ��ʸ������Ⱦ��2byte�ˤ����ޤ�ޤ���
�����⡢����ۤ��񤷤�����ޤ��󡣰ʲ����������Ƹ�����Ǥ������Ƥ�24bit�Ǽ��ޤ�Ȥ������Ȥ�UTF-24�ȤǤ�̾�դ��ޤ��礦����
#!/usr/bin/perl use strict; use warnings; use Encode; package Encode::UTF24; use base qw/Encode::Encoding/; __PACKAGE__->Define('UTF-24'); sub perlio_ok { 0 } sub decode { my ( $self, $bytes ) = @_; my $utf8 = ''; for ( my $i = 0 ; $i < length($bytes) ; $i++ ) { my $o0 = ord substr $bytes, $i, 1; my $o1 = ord substr $bytes, $i + 1, 1; if ($o1 < 0x80){ $utf8 .= chr($o0); }else{ my $o2 = ord substr $bytes, $i + 2, 1; if ( $o2 < 0x80 ) { $utf8 .= chr( ( $o0 << 7 ) + ( $o1 & 0x7F ) ); $i += 1; } else{ $utf8 .= chr( ( $o0 << 14 ) + ( ( $o1 & 0x7F ) << 7 ) + ( $o2 & 0x7F ) ); $i += 2; } } } return $utf8; } sub encode { my ( $self, $utf8 ) = @_; my $bytes = ''; for my $ord ( unpack 'U*', $utf8 ) { $bytes .= $ord < 0x80 ? chr($ord) : $ord < 0x4000 ? pack 'C2', $ord >> 7, 0x80 + ( $ord & 0x7F ) : $ord < 0x10FFFF ? pack 'C3', $ord >> 14, 0x80 + ( ( $ord >> 7 ) & 0x7f ), 0x80 + ( $ord & 0x7F ) : die "chr($ord) is impossible"; # never happens } return $bytes; } 1; package main; use utf8; local $\ = "\n"; sub hexdump{ join " ", map { sprintf "%02x", $_ } @_; } binmode STDOUT, ":utf8"; for my $utf8 (qw/d �� �� �� 𪚲 𪚲�Ƥ���d/){ my $utf24 = encode('UTF-24', $utf8); print "$utf8:", hexdump unpack 'C*', encode_utf8 $utf8; print "$utf8:", hexdump unpack 'C*', $utf24; print "$utf8:", hexdump unpack 'C*', encode_utf8 decode('UTF-24', $utf24); print ""; }
����ˤ�ؤ�餺���ʤ�UTF-8�� Variable Byte Code ����Ѥ��ʤ��ä��ΤǤ��礦��
������ͳ�ϡ�Unicode̤�б��Υ��եȥ�����������ò¤¢¤Þ¤ï¿½Ëµï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ä¤¹ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½Ã¤ï¿½ï¿½ï¿½Í½ï¿½Û¤Ç¤ï¿½ï¿½Þ¤ï¿½ï¿½ï¿½ï¿½ã¤¨ï¿½Ð¡ï¿½𪚲(U+2A6B2)����ò¸«¤Æ¤ß¤ï¿½È¡ï¿½\x0a,\xcd,\xb2�Ȥ����Х�����ˤʤäƤ��ޤ�����\x0a��ASCII��LF������Ǥϡְ�Ԥ��Ȥ˽����פǤ��ޤ���
�̤Τ������򤹤�С�Variable Byte Code �ˤϡ���Ĺ�����ʤ��ä��ΤǤ���UTF-8�ˤϡ��֤����ޡס����ʤ���֤����ˤ����Ф˸���ʤ���Byte������ޤ���\xFF�ʤɤ������Ǥ�����������Ѥ��ư�������ˤ���ʤ��Ȥ򤷤����Ȥ�����ޤ���
404 Blog Not Found:EUC-UTF8�β�ǽ��������С����ȤϤ��������������Ǥ���EUC-UTF8-CHAR = EUC-CHAR | \xFF + UTF-8-CHAR
�������ä����Ȥϡ���̵�̤ʥӥåȤ��ʤ���Variable Byte Code�Ǥ��Բ�ǽ�Ǥ���
������äƹͤ���ȡ�UTF-8�Ȥ����Τϼ¤ˤ褯����Ƥ��ޤ����ɤ�byte��ʸ������Ƭ�ǡ���������Ƭbyte�򸫤�С֤���ʸ���ˤ��Ȳ�byteɬ�פʤΤ��פ������ɤ����ˤ狼��褦�ˤ�ʤäƤ��ޤ���
��沽��ͤ���Ȥ��ˤϡ��֤ɤ�������󤹤뤫�פȤ����������뤳�Ȥʤ��顢�֤ɤ����پ�Ĺ����Ĥ����פȤ������Ȥ⤫�ʤ�����ˤʤäƤ��ޤ����Ϥ��ᤫ��ѤäĤ�ѤäĤ���ȡ���ǵã¤ï¿½ò¸«¤ë¡£Unicode���Ρ����줬����ˤˤʤäƤ��ޤ�����
Dan the Man with Too Many Encodings to Support
�ɤ�ž��Ǥ⤳���ʤ�褦�ˤʤäƤ����ΤǤ���
�����Ǥ����ߤ����櫓�ǤϤ���ޤ���
�������ƿȤ����ͤޤ��礦��