Unicode 字符串

原文：https://tensorflow.google.cn/tutorials/load_data/unicode

简介

处理自然语言的模型通常使用不同的字符集来处理不同的语言。Unicode 是一种标准的编码系统，用于表示几乎所有语言的字符。每个字符使用 0 和 0x10FFFF 之间的唯一整数码位进行编码。Unicode 字符串是由零个或更多码位组成的序列。

本教程介绍了如何在 TensorFlow 中表示 Unicode 字符串，以及如何使用标准字符串运算的 Unicode 等效项对其进行操作。它会根据字符体系检测将 Unicode 字符串划分为不同词例。

import tensorflow as tf

`tf.string` 数据类型

您可以使用基本的 TensorFlow tf.string dtype 构建字节字符串张量。Unicode 字符串默认使用 UTF-8 编码。

tf.constant(u"Thanks 😊") 
```py

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

[`tf.string`](https://tensorflow.google.cn/api_docs/python/tf#string) 张量可以容纳不同长度的字节字符串，因为字节字符串会被视为原子单元。字符串长度不包括在张量维度中。

tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

注：使用 Python 构造字符串时，v2 和 v3 对 Unicode 的处理方式有所不同。在 v2 中，Unicode 字符串用前缀“u”表示（如上所示）。在 v3 中，字符串默认使用 Unicode 编码。

## 表示 Unicode

在 TensorFlow 中有两种表示 Unicode 字符串的标准方式：

*   `string` 标量 - 使用已知[字符编码](https://en.wikipedia.org/wiki/Character_encoding)对码位序列进行编码。
*   `int32` 向量 - 每个位置包含单个码位。

例如，以下三个值均表示 Unicode 字符串 `"语言处理"`：

Unicode string, represented as a UTF-8 encoded string scalar.

text_utf8 = tf.constant(u"语言处理") text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

Unicode string, represented as a UTF-16-BE encoded string scalar.

text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE")) text_utf16be

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

Unicode string, represented as a vector of Unicode code points.

text_chars = tf.constant([ord(char) for char in u"语言处理"]) text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

### 在不同表示之间进行转换

TensorFlow 提供了在下列不同表示之间进行转换的运算：

*   [`tf.strings.unicode_decode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode)：将编码的字符串标量转换为码位的向量。
*   [`tf.strings.unicode_encode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_encode)：将码位的向量转换为编码的字符串标量。
*   [`tf.strings.unicode_transcode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_transcode)：将编码的字符串标量转换为其他编码。

tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

tf.strings.unicode_encode(text_chars, output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

tf.strings.unicode_transcode(text_utf8, input_encoding='UTF8', output_encoding='UTF-16-BE')

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

### 批次维度

解码多个字符串时，每个字符串中的字符数可能不相等。返回结果是 [`tf.RaggedTensor`](https://tensorflow.google.cn/guide/ragged_tensor)，其中最里面的维度的长度会根据每个字符串中的字符数而变化：

A batch of Unicode strings, each represented as a UTF8-encoded string.

batch_utf8 = [s.encode('UTF-8') for s in [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']] batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8') for sentence_chars in batch_chars_ragged.to_list(): print(sentence_chars)

[104, 195, 108, 108, 111] [87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119] [71, 246, 246, 100, 110, 105, 103, 104, 116] [128522]

您可以直接使用此 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor)，也可以使用 [`tf.RaggedTensor.to_tensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor#to_tensor) 和 [`tf.RaggedTensor.to_sparse`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor#to_sparse) 方法将其转换为带有填充的密集 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 或 [`tf.SparseTensor`](https://tensorflow.google.cn/api_docs/python/tf/sparse/SparseTensor)。

batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1) print(batch_chars_padded.numpy())

[[ 104 195 108 108 111 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] [ 87 104 97 116 32 105 115 32 116 104 101 32 119 101 97 116 104 101 114 32 116 111 109 111 114 114 111 119] [ 71 246 246 100 110 105 103 104 116 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] [128522 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]]

batch_chars_sparse = batch_chars_ragged.to_sparse()

在对多个具有相同长度的字符串进行编码时，可以将 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 用作输入：

tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]], output_encoding='UTF-8')

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

当对多个具有不同长度的字符串进行编码时，应将 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor) 用作输入：

tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy= array([b'h\xc3\x83llo', b'What is the weather tomorrow', b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

如果您的张量具有填充或稀疏格式的多个字符串，请在调用 `unicode_encode` 之前将其转换为 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor)：

tf.strings.unicode_encode( tf.RaggedTensor.from_sparse(batch_chars_sparse), output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy= array([b'h\xc3\x83llo', b'What is the weather tomorrow', b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

tf.strings.unicode_encode( tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1), output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy= array([b'h\xc3\x83llo', b'What is the weather tomorrow', b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

## Unicode 运算

### 字符长度

[`tf.strings.length`](https://tensorflow.google.cn/api_docs/python/tf/strings/length) 运算具有 `unit` 参数，该参数表示计算长度的方式。`unit` 默认为 `"BYTE"`，但也可以将其设置为其他值（例如 `"UTF8_CHAR"` 或 `"UTF16_CHAR"`），以确定每个已编码 `string` 中的 Unicode 码位数量。

Note that the final character takes up 4 bytes in UTF8.

thanks = u'Thanks 😊'.encode('UTF-8') num_bytes = tf.strings.length(thanks).numpy() num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy() print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars))

11 bytes; 8 UTF-8 characters

### 字符子字符串

类似地，[`tf.strings.substr`](https://tensorflow.google.cn/api_docs/python/tf/strings/substr) 运算会接受 "`unit`" 参数，并用它来确定 "`pos`" 和 "`len`" 参数包含的偏移类型。

default: unit='BYTE'. With len=1, we return a single byte.

tf.strings.substr(thanks, pos=7, len=1).numpy()

b'\xf0'

Specifying unit='UTF8_CHAR', we return a single character, which in this case

is 4 bytes.

print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())

b'\xf0\x9f\x98\x8a'

### 拆分 Unicode 字符串

[`tf.strings.unicode_split`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_split) 运算会将 Unicode 字符串拆分为单个字符的子字符串：

tf.strings.unicode_split(thanks, 'UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'], dtype=object)

### 字符的字节偏移量

为了将 [`tf.strings.unicode_decode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode) 生成的字符张量与原始字符串对齐，了解每个字符开始位置的偏移量很有用。方法 [`tf.strings.unicode_decode_with_offsets`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode_with_offsets) 与 `unicode_decode` 类似，不同的是它会返回包含每个字符起始偏移量的第二张量。

codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()): print("At byte offset {}: codepoint {}".format(offset, codepoint))

At byte offset 0: codepoint 127880 At byte offset 4: codepoint 127881 At byte offset 8: codepoint 127882

## Unicode 字符体系

每个 Unicode 码位都属于某个码位集合，这些集合被称作[字符体系](https://en.wikipedia.org/wiki/Script_%28Unicode%29)。某个字符的字符体系有助于确定该字符可能所属的语言。例如，已知 'Б' 属于西里尔字符体系，表明包含该字符的现代文本很可能来自某个斯拉夫语种（如俄语或乌克兰语）。

TensorFlow 提供了 [`tf.strings.unicode_script`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_script) 运算来确定某一给定码位使用的是哪个字符体系。字符体系代码是对应于[国际 Unicode 组件](http://site.icu-project.org/home) (ICU) [`UScriptCode`](http://icu-project.org/apiref/icu4c/uscript_8h.html) 值的 `int32` 值。

uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б']

print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

[17 8]

[`tf.strings.unicode_script`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_script) 运算还可以应用于码位的多维 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 或 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor)：

print(tf.strings.unicode_script(batch_chars_ragged))

<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>

## 示例：简单分词

分词是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时，这通常很容易，但是某些语言（如中文和日语）不使用空格，而某些语言（如德语）中存在长复合词，必须进行拆分才能分析其含义。在网页文本中，不同语言和字符体系常常混合在一起，例如“NY 株価”（纽约证券交易所）。

我们可以利用字符体系的变化进行粗略分词（不实现任何 ML 模型），从而估算词边界。这对类似上面“NY 株価”示例的字符串都有效。这种方法对大多数使用空格的语言也都有效，因为各种字符体系中的空格字符都归类为 USCRIPT_COMMON，这是一种特殊的字符体系代码，不同于任何实际文本。

dtype: string; shape: [num_sentences]

The sentences to process. Edit this line to try out different inputs!

sentence_texts = [u'Hello, world.', u'世界こんにちは']

首先，我们将句子解码为字符码位，然后查找每个字符的字符体系标识符。

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_codepoint[i, j] is the codepoint for the j'th character in

the i'th sentence.

sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8') print(sentence_char_codepoint)

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_scripts[i, j] is the unicode script of the j'th character in

the i'th sentence.

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint) print(sentence_char_script)

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]> <tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>

接下来，我们使用这些字符体系标识符来确定添加词边界的位置。我们在每个句子的开头添加一个词边界；如果某个字符与前一个字符属于不同的字符体系，也为该字符添加词边界。

dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_starts_word[i, j] is True if the j'th character in the i'th

sentence is the start of a word.

sentence_char_starts_word = tf.concat( [tf.fill([sentence_char_script.nrows(), 1], True), tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])], axis=1)

dtype: int64; shape: [num_words]

word_starts[i] is the index of the character that starts the i'th word (in

the flattened list of characters from all sentences).

word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1) print(word_starts)

tf.Tensor([ 0 5 7 12 13 15], shape=(6,), dtype=int64)

然后，我们可以使用这些起始偏移量来构建 `RaggedTensor`，它包含了所有批次的单词列表：

dtype: int32; shape: [num_words, (num_chars_per_word)]

word_char_codepoint[i, j] is the codepoint for the j'th character in the

i'th word.

word_char_codepoint = tf.RaggedTensor.from_row_starts( values=sentence_char_codepoint.values, row_starts=word_starts) print(word_char_codepoint)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

最后，我们可以将词码位 `RaggedTensor` 划分回句子中：

dtype: int64; shape: [num_sentences]

sentence_num_words[i] is the number of words in the i'th sentence.

sentence_num_words = tf.reduce_sum( tf.cast(sentence_char_starts_word, tf.int64), axis=1)

dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]

sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character

in the j'th word in the i'th sentence.

sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths( values=word_char_codepoint, row_lengths=sentence_num_words) print(sentence_word_char_codepoint)

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>

为了使最终结果更易于阅读，我们可以将其重新编码为 UTF-8 字符串：

tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

[[b'Hello', b', ', b'world', b'.'], [b'\xe4\xb8\x96\xe7\x95\x8c', b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

019.md

019.md

Unicode 字符串

简介

`tf.string` 数据类型

Unicode string, represented as a UTF-8 encoded string scalar.

Unicode string, represented as a UTF-16-BE encoded string scalar.

Unicode string, represented as a vector of Unicode code points.

A batch of Unicode strings, each represented as a UTF8-encoded string.

Note that the final character takes up 4 bytes in UTF8.

default: unit='BYTE'. With len=1, we return a single byte.

Specifying unit='UTF8_CHAR', we return a single character, which in this case

is 4 bytes.

dtype: string; shape: [num_sentences]

The sentences to process. Edit this line to try out different inputs!

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_codepoint[i, j] is the codepoint for the j'th character in

the i'th sentence.

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_scripts[i, j] is the unicode script of the j'th character in

the i'th sentence.

dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_starts_word[i, j] is True if the j'th character in the i'th

sentence is the start of a word.

dtype: int64; shape: [num_words]

word_starts[i] is the index of the character that starts the i'th word (in

the flattened list of characters from all sentences).

dtype: int32; shape: [num_words, (num_chars_per_word)]

word_char_codepoint[i, j] is the codepoint for the j'th character in the

i'th word.

dtype: int64; shape: [num_sentences]

sentence_num_words[i] is the number of words in the i'th sentence.

dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]

sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character

in the j'th word in the i'th sentence.

Files

019.md

Latest commit

History

019.md

File metadata and controls

Unicode 字符串

简介

tf.string 数据类型

Unicode string, represented as a UTF-8 encoded string scalar.

Unicode string, represented as a UTF-16-BE encoded string scalar.

Unicode string, represented as a vector of Unicode code points.

A batch of Unicode strings, each represented as a UTF8-encoded string.

Note that the final character takes up 4 bytes in UTF8.

default: unit='BYTE'. With len=1, we return a single byte.

Specifying unit='UTF8_CHAR', we return a single character, which in this case

is 4 bytes.

dtype: string; shape: [num_sentences]

The sentences to process. Edit this line to try out different inputs!

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_codepoint[i, j] is the codepoint for the j'th character in

the i'th sentence.

dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_scripts[i, j] is the unicode script of the j'th character in

the i'th sentence.

dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]

sentence_char_starts_word[i, j] is True if the j'th character in the i'th

sentence is the start of a word.

dtype: int64; shape: [num_words]

word_starts[i] is the index of the character that starts the i'th word (in

the flattened list of characters from all sentences).

dtype: int32; shape: [num_words, (num_chars_per_word)]

word_char_codepoint[i, j] is the codepoint for the j'th character in the

i'th word.

dtype: int64; shape: [num_sentences]

sentence_num_words[i] is the number of words in the i'th sentence.

dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]

sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character

in the j'th word in the i'th sentence.

`tf.string` 数据类型