�¤ϡ�����������ɤ�������沽�򡢲桹�������ܤˤ��Ƥ��ޤ���

����桢����桢���������ˤ�밵�̸��� - naoya�ΤϤƤʥ������꡼
�̾�������� 32 �ӥåȤ� 4 �Х��Ȥθ���Ĺ�ˤ��Х��ʥ����Ǥ����������ʿ�������������и������礭�ʿ����ϤۤȤ�ɽи����ʤ��Ȥ�����Ψʬ�ۤΤ�ȤǤ�̵�̤ʥӥåȤ���Ω���ޤ���

UTF-8�Ǥ���

UTF-8�ϡ�0x0����0x10FFFF�ޤǤ������򡢰ʲ��Τ褦�ˤ��ƥХ�������Ѵ����ޤ���

Range/Offset0123
0x00-0x7F0xxxxxxx
0x80-0x3FF110xxxxx10xxxxxx
0x400-0xFFFF1110xxxx10xxxxxx10xxxxxx
0x10000-0x1FFFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

���Ƥ��̤ꡢUTF-8�μ�ˡ�Ǥϡ�0x1FFFFF�ޤǤ���������沽�Ǥ���ΤǤ������ºݤ�Unicode�Ǥ�0x1FFFFF�ǤϤʤ�0x10FFFF�ޤǤ����Ȥ��Ƥ��ޤ��󡣰����=��ʸ���������Ĺ4byte��ɬ�פʤΤǤ������Ǥ��ɤ��Ȥ���ascii�Ϲ⡹1 byte�����ˤ褯�Ȥ���Latin1��ʸ����2byte��������BMP��ˤ��뤫�ʤ�����ΤۤȤ�ɤ�3byte��­��ƤϤ��ޤ���

������Ф���Variable Byte Code �Ǥϡ������ʤ�ޤ���

Range/Offset012
0x00-0x7F0xxxxxxx
0x80-0x3fff0xxxxxxx1xxxxxxx
0x4000-0x1FFFFF0xxxxxxx1xxxxxxx1xxxxxxx

���Ƥ� Unicode �������줤��3byte����˼��ޤ�塢�������ʤ�Ҥ餬�ʤ�ޤ᤿ɽ��ʸ������Ⱦ��2byte�ˤ����ޤ�ޤ���

�����⡢����ۤ��񤷤�����ޤ��󡣰ʲ����������Ƹ�����Ǥ������Ƥ�24bit�Ǽ��ޤ�Ȥ������Ȥ�UTF-24�ȤǤ�̾�դ��ޤ��礦����

#!/usr/bin/perl
use strict;
use warnings;
use Encode;

package Encode::UTF24;
use base qw/Encode::Encoding/;
__PACKAGE__->Define('UTF-24');
sub perlio_ok { 0 }

sub decode {
    my ( $self, $bytes ) = @_;
    my $utf8 = '';
    for ( my $i = 0 ; $i < length($bytes) ; $i++ ) {
        my $o0 = ord substr $bytes, $i, 1;
        my $o1 = ord substr $bytes, $i + 1, 1;
        if ($o1 < 0x80){
            $utf8 .= chr($o0);
        }else{
            my $o2 = ord substr $bytes, $i + 2, 1;
            if ( $o2 < 0x80 ) {
                $utf8 .= chr( ( $o0 << 7 ) + ( $o1 & 0x7F ) );
                $i += 1;
            }
            else{
                $utf8 .= chr(
                    ( $o0 << 14 ) + ( ( $o1 & 0x7F ) << 7 ) + ( $o2 & 0x7F )
                );
                $i += 2;
            }
        }
    }
    return $utf8;
}

sub encode {
    my ( $self, $utf8 ) = @_;
    my $bytes = '';
    for my $ord ( unpack 'U*', $utf8 ) {
        $bytes .=
            $ord < 0x80 ? chr($ord)
            : $ord < 0x4000 ? pack 'C2', $ord >> 7, 0x80 + ( $ord & 0x7F )
            : $ord < 0x10FFFF ? 
            pack 'C3', $ord >> 14,
            0x80 + ( ( $ord >> 7 ) & 0x7f ), 
            0x80 + ( $ord & 0x7F )
	    : die "chr($ord) is impossible"; # never happens
    }
    return $bytes;
}

1;

package main;
use utf8;
local $\ = "\n";

sub hexdump{
    join " ", map { sprintf "%02x", $_ } @_;
}

binmode STDOUT, ":utf8";
for my $utf8 (qw/d �� �� �� 𪚲 𪚲�Ƥ���d/){
    my $utf24 = encode('UTF-24', $utf8);
    print "$utf8:", hexdump unpack 'C*', encode_utf8 $utf8;
    print "$utf8:", hexdump unpack 'C*', $utf24;
    print "$utf8:", hexdump unpack 'C*', encode_utf8 decode('UTF-24', $utf24);
    print "";
}

����ˤ�ؤ�餺���ʤ�UTF-8�� Variable Byte Code ����Ѥ��ʤ��ä��ΤǤ��礦��

������ͳ�ϡ�Unicode̤�б��Υ��եȥ�����������ò¤¢¤Þ¤ï¿½Ëµï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ä¤¹ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½Ã¤ï¿½ï¿½ï¿½Í½ï¿½Û¤Ç¤ï¿½ï¿½Þ¤ï¿½ï¿½ï¿½ï¿½ã¤¨ï¿½Ð¡ï¿½𪚲(U+2A6B2)����ò¸«¤Æ¤ß¤ï¿½È¡ï¿½\x0a,\xcd,\xb2�Ȥ����Х�����ˤʤäƤ��ޤ�����\x0a��ASCII��LF������Ǥϡְ�Ԥ��Ȥ˽����פǤ��ޤ���

�̤Τ������򤹤�С�Variable Byte Code �ˤϡ���Ĺ�����ʤ��ä��ΤǤ���UTF-8�ˤϡ��֤����ޡס����ʤ���֤����ˤ����Ф˸���ʤ���Byte������ޤ���\xFF�ʤɤ������Ǥ�����������Ѥ��ư�������ˤ���ʤ��Ȥ򤷤����Ȥ�����ޤ���

404 Blog Not Found:EUC-UTF8�β�ǽ��
������С����ȤϤ��������������Ǥ���
EUC-UTF8-CHAR = EUC-CHAR | \xFF + UTF-8-CHAR

�������ä����Ȥϡ���̵�̤ʥӥåȤ��ʤ���Variable Byte Code�Ǥ��Բ�ǽ�Ǥ���

������äƹͤ���ȡ�UTF-8�Ȥ����Τϼ¤ˤ褯����Ƥ��ޤ����ɤ�byte��ʸ������Ƭ�ǡ���������Ƭbyte�򸫤�С֤���ʸ���ˤ��Ȳ�byteɬ�פʤΤ��פ������ɤ����ˤ狼��褦�ˤ�ʤäƤ��ޤ���

��沽��ͤ���Ȥ��ˤϡ��֤ɤ�������󤹤뤫�פȤ����������뤳�Ȥʤ��顢�֤ɤ����پ�Ĺ����Ĥ����פȤ������Ȥ⤫�ʤ�����ˤʤäƤ��ޤ����Ϥ��ᤫ��ѤäĤ�ѤäĤ���ȡ���ǵ㤭�򸫤롣Unicode���Ρ����줬����ˤˤʤäƤ��ޤ�����

Dan the Man with Too Many Encodings to Support