description: Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Tokenizes a tensor of UTF-8 string tokens into subword pieces.
Inherits From: TokenizerWithOffsets,
Tokenizer,
SplitterWithOffsets,
Splitter, Detokenizer
text.WordpieceTokenizer(
vocab_lookup_table,
suffix_indicator='##',
max_bytes_per_word=100,
max_chars_per_token=None,
token_out_type=dtypes.int64,
unknown_token='[UNK]',
split_unknown_characters=False
)
Each UTF-8 string token in the input is split into its corresponding wordpieces,
drawing from the list in the file vocab_lookup_table.
Algorithm summary: For each token, the longest token prefix that is in the
vocabulary is split off. Any part of the token that remains is prefixed using
the suffix_indicator, and the process of removing the longest token prefix
continues. The unknown_token (UNK) is used when what remains of the token is
not in the vocabulary, or if the token is too long.
When token_out_type is tf.string, the output tensor contains strings in the
vocabulary (or UNK). When it is an integer type, the output tensor contains
indices into the vocabulary list (with UNK being after the last entry).
>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
... "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
... token_out_type=tf.string)
>>> tokenizer.tokenize(["they're", "the", "greatest"])
<tf.RaggedTensor [[b'they', b"##'", b'##re'], [b'the'], [b'great', b'##est']]>
>>> tokenizer.tokenize(["they", "are", "great"])
<tf.RaggedTensor [[b'they'], [b'[UNK]'], [b'great']]>
>>> int_tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
... token_out_type=tf.int32)
>>> int_tokenizer.tokenize(["the", "greatest"])
<tf.RaggedTensor [[3], [4, 5]]>
>>> int_tokenizer.tokenize(["really", "the", "greatest"])
<tf.RaggedTensor [[6], [3], [4, 5]]>
Tensor or ragged tensor inputs result in ragged tensor outputs. Scalar inputs (which are just a single token) result in tensor outputs.
>>> tokenizer.tokenize("they're")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'they', b"##'", b'##re'],
dtype=object)>
>>> tokenizer.tokenize(["they're"])
<tf.RaggedTensor [[b'they', b"##'", b'##re']]>
>>> tokenizer.tokenize(tf.ragged.constant([["they're"]]))
<tf.RaggedTensor [[[b'they', b"##'", b'##re']]]>
Empty strings are tokenized into empty (ragged) tensors.
>>> tokenizer.tokenize([""])
<tf.RaggedTensor [[]]>
detokenize(
token_ids
)
Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.
>>> import pathlib
>>> pathlib.Path('/tmp/detok_vocab.txt').write_text(
... 'a b c ##a ##b ##c'.replace(' ', '\n'))
>>> wordpiece = WordpieceTokenizer('/tmp/detok_vocab.txt')
>>> token_ids = [[0, 4, 5, 2, 5, 5, 5]]
>>> wordpiece.detokenize(token_ids)
<tf.RaggedTensor [[b'abc', b'cccc']]>
The word pieces are joined along the innermost axis to make words. So the result has the same rank as the input, but the innermost axis of the result indexes words instead of word pieces.
The shape transformation is: [..., wordpieces] => [..., words]
When the input shape is [..., words, wordpieces] (like the output of
WordpieceTokenizer.tokenize)
the result's shape is [..., words, 1]. The additional ragged axis can be
removed using words.merge_dims(-2, -1).
Note: This method assumes wordpiece IDs are dense on the interval [0, vocab_size).
| Args | |
|---|---|
| `token_ids` | A `RaggedTensor` or `Tensor` with an int dtype. Must have `ndims >= 2` |
| Returns | |
|---|---|
| A `RaggedTensor` with dtype `string` and the rank as the input `token_ids`. |
split(
input
)
Alias for
Tokenizer.tokenize.
split_with_offsets(
input
)
Alias for
TokenizerWithOffsets.tokenize_with_offsets.
tokenize(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
... "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokens = [["they're", 'the', 'greatest']]
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
... token_out_type=tf.string)
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
| Returns | |
|---|---|
| A `RaggedTensor` of tokens where `tokens[i1...iN, j]` is the string contents (or ID in the vocab_lookup_table representing that string) of the `jth` token in `input[i1...iN]` |
tokenize_with_offsets(
input
)
Tokenizes a tensor of UTF-8 string tokens further into subword tokens.
>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
... "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokens = [["they're", 'the', 'greatest']]
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
... token_out_type=tf.string)
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
[b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
| Args | |
|---|---|
| `input` | An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings. |
| Returns | |
|---|---|
|
A tuple `(tokens, start_offsets, end_offsets)` where:
tokens[i1...iN, j]: is a |
vocab_size(
name=None
)
Returns the vocabulary size.
| Args | |
|---|---|
| `name` | The name argument that is passed to the op function. |
| Returns | |
|---|---|
| A scalar representing the vocabulary size. |