ã¯ããã«
ãã«ããï¼ãªãã¬ããã«ãã¼ã©ï¼ nikkieã§ãã
å æ¥ChatGPTãã©ã®ããã«æ¥æ¬èªããã¹ãããã¼ã¯ã³åããã®ãè¦ãã¾ããã
ãã¼ã¯ã³ã®IDãã対å¿ããããã¹ããè¦ããã¨Pythonã®bytesãæ±ã£ãããã§ããããã®ä¸ã§æããçåã«ã¤ãã¦ã¢ã¦ããããã§ãã
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- ãï¼U+3042ï¼ãencodeããã¨b'\xe3\x81\x82'
- ç¶ãæåãå¤æãã¦ããã¨è¦åæ§ããã
- è±ç·ï¼ãããªã¨ããã§ãè¦ããã¾ã
- å¤æè¦åï¼1110 xxxx 10xx xxxx 10xx xxxx
- ã²ãããªã»ã«ã¿ã«ãã»æ¼¢åã«ã¯ã1110 xxxx 10xx xxxx 10xx xxxxã®3ãã¤ãå¤æãé©ç¨ããã
- çµããã«
- P.S. Unicodeã¾ããã®åèæç®
ãï¼U+3042ï¼ãencodeããã¨b'\xe3\x81\x82'
ã²ãããªã®ãããã¯Unicodeã®ã³ã¼ããã¤ã³ãï¼ç¬¦å·ä½ç½®ï¼ãU+3042
ã§ãã
â»ãã®è¨äºã«ãããPythonã®ãã¼ã¸ã§ã³ã¯ 3.10.9 ã§ã
>>> hex(ord("ã")) '0x3042'
- ã³ã¼ããã¤ã³ãã®è¡¨è¨ã¯
U+<16é²æ°>
ã§ã - çµã¿è¾¼ã¿é¢æ°
ord
ã§1æåã®ãUnicode ã³ã¼ããã¤ã³ãã表ãæ´æ°ãï¼10é²ï¼ãè¿ãã¾ã - çµã¿è¾¼ã¿é¢æ°
hex
ã§16é²ã«å¤æãã¾ã
ã²ãããªã®ãããï¼str
ï¼ãbytes
ã«å¤æããã¨
>>> "ã".encode() b'\xe3\x81\x82'
https://docs.python.org/ja/3/library/stdtypes.html#str.encode
- 第1å¼æ°ã®ããã©ã«ãå¤ã¯UTF-8ã§ã
>>> "ã".encode("utf8") b'\xe3\x81\x82'
ç¶ãæåãå¤æãã¦ããã¨è¦åæ§ããã
ãããã®å¾ã®ã²ãããªã«ã¤ãã¦ãã³ã¼ããã¤ã³ãã¨bytes
ãè¦ã¦ããã¾ããã
>>> hex(ord("ã")) '0x3044' >>> "ã".encode() b'\xe3\x81\x84' >>> hex(ord("ã")) '0x3046' >>> "ã".encode() b'\xe3\x81\x86' >>> hex(ord("ã")) '0x3048' >>> "ã".encode() b'\xe3\x81\x88' >>> hex(ord("ã")) '0x304a' >>> "ã".encode() b'\xe3\x81\x8a'
ã²ãããªã1æåé²ãã¨ãã³ã¼ããã¤ã³ããbytesã2ãã¤å¢å ãã¦ãã¾ãããï¼
1ã®ä½ï¼ä¸çªå³ã®æ¡ï¼ã¯åãå¤ã§ãï¼2 -> 4 -> 6 -> 8 -> aï¼
ãããè¦ã¦ããã³ã¼ããã¤ã³ãããbytesã¸ã®å¤æè¦åããªã«ãããããããªããï¼ãã¨ãç§ãæ°ã«ãªãã¾ããï¼
è±ç·ï¼ãããªã¨ããã§ãè¦ããã¾ã
Pythonæ¨æºã©ã¤ãã©ãªã®ä¸ã§ãUnicodeã®ã³ã¼ããã¤ã³ããããããã¤ãã«å¤æããå¤ãç»å ´ãã¾ã
jsonã§Unicodeã³ã¼ããã¤ã³ã
>>> json.dumps({"key": "ãã"}) '{"key": "\\u3042\\u3044"}' >>> json.dumps({"key": "ãã"}, ensure_ascii=False) '{"key": "ãã"}'
ensure_ascii
å¼æ°ã¯ããã©ã«ãå¤ãTrue
ã§ãã
https://docs.python.org/ja/3/library/json.html#json.dump
ensure_ascii ã (ããã©ã«ãå¤ã®) true ã®å ´åãåºåã§ã¯å ¥åãããå ¨ã¦ã®é ASCII æåã¯ã¨ã¹ã±ã¼ãããã¦ãããã¨ãä¿è¨¼ããã¦ãã¾ããensure_ascii ã false ã®å ´åããããã®æåã¯ãã®ã¾ã¾åºåããã¾ãã
ã¨ã¹ã±ã¼ããããï¼Unicodeã³ã¼ããã¤ã³ãã§ã®åºåã¨ãããã¨ã§ããã
urllib.parseã§ãã¤ãå
URLã«æ¥æ¬èªã使ã£ãã¨ãã«ã¯ãã¤ãåã«å¤æããã¦ãã¾ãã
ãURLã¨ã³ã³ã¼ãã£ã³ã°ãããPercent-Encodingï¼RFC 3986ï¼ãã¨ããããã§ãã
- ãã©ã¦ã¶ã®URLãã¼ã®è¡¨ç¤ºï¼https://example.com/page/ãã
- å®éã¯
https://example.com/page/%E3%81%82%E3%81%84
URLã¨ã³ã³ã¼ãã£ã³ã°ã¯urllib.parse.urlencode
ã¨ããã¾ãã«ããã¨ããé¢æ°ãããã¾ããï¼
https://docs.python.org/ja/3/library/urllib.parse.html#urllib.parse.urlencode
>>> import urllib.parse >>> urllib.parse.urlencode({"key": "ãã"}) 'key=%E3%81%82%E3%81%84'
ä½ããã®è¦åã§URLã¨ã³ã³ã¼ãã£ã³ã°ããã¦ããã¨æã£ã¦ãã¾ããããæååãbytesã«encodeããã®ã¨åãè¦åã ã£ãã®ã§ããï¼
å¤æè¦åï¼1110 xxxx 10xx xxxx 10xx xxxx
調ã¹ãæ«ã«ä»¥ä¸ã«è¡ãçãã¾ããã
UTF-8ã®ç¬¦å·åæ¹æ³ããå¼ç¨ãã¾ãã
UTF-8ã¯, Code pointã1~4bytesã®å¯å¤é·ã§å¤æãã¾ã.
U+0800 ~ U+FFFFã®ç¯å²ã®ã³ã¼ããã¤ã³ãã¯ã以ä¸ã®ããã«3ãã¤ãã«å¤æãããããã§ãã
1110 xxxx 10xx xxxx 10xx xxxx
å¤æè¦åã®é©ç¨ä¾ï¼ããã
- ãï¼U+3042ï¼ã¯ãU+0800 ~ U+FFFFã®ç¯å²ãªã®ã§3ãã¤ãã«å¤æããã
- U+3042ã®ããã表è¨ã¯ 0011 0000 0100 0010
- 1ãã¤ãç®ï¼1110 0011ï¼
E3
ï¼- U+3042ã®ããã表è¨ããå é 4ããããåããã
- 2ãã¤ãç®ï¼1000 0001ï¼
81
ï¼- U+3042ã®ããã表è¨ãã5ãããç®ã10ãããç®ãåããã
- 3ãã¤ãç®ï¼1000 0010ï¼
82
ï¼- U+3042ã®ããã表è¨ãã11ãããç®ãæå¾ãåããã
- 3ãã¤ã㯠E38182
"ã".encode()
ã§è¦ãbytesã ï¼ï¼ï¼ï¼
ä»ã®ä¾ï¼ããã
- ãï¼U+304A ã¯ãU+0800 ~ U+FFFFã®ç¯å²å -> 3ãã¤ãã«å¤æ
- U+304Aã®ãããè¡¨è¨ 0011 0000 0100 1010
- 1ãã¤ãç®ï¼1110 0011ï¼
E3
ï¼ - 2ãã¤ãç®ï¼1000 0001ï¼
81
ï¼ - 3ãã¤ãç®ï¼1000 1010ï¼
8A
ï¼ - ð E3818A
ãã²ã¨ã¤ä»ã®ä¾ï¼ãèªã
>>> hex(ord("èª")) '0x8a95' >>> "èª".encode() b'\xe8\xaa\x95'
- èªï¼U+8A95ã¯ãU+0800 ~ U+FFFFã®ç¯å²å -> 3ãã¤ãã«å¤æ
- U+8A95ã®ãããè¡¨è¨ 1000 1010 1001 0101
- 1ãã¤ãç®ï¼1110 1000ï¼
E8
ï¼ - 2ãã¤ãç®ï¼1010 1010ï¼
AA
ï¼ - 3ãã¤ãç®ï¼1001 0101ï¼
95
ï¼ - ð E8AA05
ã²ãããªã»ã«ã¿ã«ãã»æ¼¢åã«ã¯ã1110 xxxx 10xx xxxx 10xx xxxxã®3ãã¤ãå¤æãé©ç¨ããã
U+0800 ~ U+FFFFã®ç¯å²ã«ã¯ãã²ãããªã»ã«ã¿ã«ãã»æ¼¢åã¯å«ã¾ããã¨ç解ãã¾ããã
- Unicodeã®åºæ¬å¤è¨èªé¢ï¼BMPï¼ã«ããã
- å«ã¾ããæå
- ã²ãããªï¼U+3040ï½U+309F
- ã«ã¿ã«ãï¼U+30A0ï½U+30FF
- æ¼¢å
- CJKçµ±åæ¼¢åï¼U+4E00ï½U+9FFF
- ä»ï¼U+3400ï½U+4DBFãU+F900ï½U+FAFF
çµããã«
chr(0x3042).encode()
ãb'\xe3\x81\x82'
ã¨ãªãï¼UTF-8ã®ï¼å¤æè¦åãæ°ã«ãªã調ã¹ãã¨ãããå®å
¨ã«ç解ã§ãã¾ããï¼
Unicodeã®ã³ã¼ããã¤ã³ãã¨ãã¦ã¯16é²4æ¡ï¼2ãã¤ãï¼ã§ãããå¤æè¦åã«ãã8ãããï¼=4+2+2ï¼è¿½å ãããã®ã§bytesã¯3ãã¤ãã«ãªãããã§ããã
ChatGPTã®ãã¼ã¯ã³ããèå³ãæã£ãããã§ãããUTF-8ã§å¤æãããã¤ãåã¯URLã¨ã³ã³ã¼ãã£ã³ã°ãªã©ä»ã®ç®æã§ãè¦ããããã¨ã«æ°ã¥ãã¾ããã
æåã³ã¼ãã¯ï¼æ»ã£ã¦ãããªããããããªãã»ã©ã®ï¼æ·±ãä¸çã«è¦ãã¦ãã¾ãããããã¯ãé°ãªããæ¯ãã¦ããã¦ãããã ãªãã¨ããå®æã§ãï¼ãã¿ãã¿ï¼
ChatGPTã®ãã¼ã¯ã³åãè¦ãã¦
— nikkie ã«ã£ãã¼ (@ftnext) 2023å¹´4æ23æ¥
>>> "ã".encode()
b'\xe3\x81\x82'
bytesã®æ±ããç¥ã
ãããã®Unicodeã³ã¼ããã¤ã³ãã£ã¦U+3042ã ãã©ãããããã¤ãåã«ããå¤æè¦åã£ã¦ããã®ããªã¨ããã£ã¦ãã£ãæ«ã«è¦ã¤ãããã¡ãhttps://t.co/t3Y9GXS4eP
1110 xxxx 10xx xxxx 10xx xxxx ã¨å¤æãçå解決ð
P.S. Unicodeã¾ããã®åèæç®
æ°ã«ãªãã¨ããã ãèªãã§ãURLã¨ã³ã³ã¼ãã£ã³ã°ãBMPã«ã¤ãã¦å¦ã³ã¾ããã
åãããããã¨æãã¾ã