Skip to content

Latest commit

 

History

History
441 lines (362 loc) · 12.3 KB

File metadata and controls

441 lines (362 loc) · 12.3 KB

description: Tokenizes a tensor of UTF-8 string tokens into subword pieces.

text.WordpieceTokenizer

View source

Tokenizes a tensor of UTF-8 string tokens into subword pieces.

Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter, Detokenizer

text.WordpieceTokenizer(
    vocab_lookup_table,
    suffix_indicator='##',
    max_bytes_per_word=100,
    max_chars_per_token=None,
    token_out_type=dtypes.int64,
    unknown_token='[UNK]',
    split_unknown_characters=False
)

Each UTF-8 string token in the input is split into its corresponding wordpieces, drawing from the list in the file vocab_lookup_table.

Algorithm summary: For each token, the longest token prefix that is in the vocabulary is split off. Any part of the token that remains is prefixed using the suffix_indicator, and the process of removing the longest token prefix continues. The unknown_token (UNK) is used when what remains of the token is not in the vocabulary, or if the token is too long.

When token_out_type is tf.string, the output tensor contains strings in the vocabulary (or UNK). When it is an integer type, the output tensor contains indices into the vocabulary list (with UNK being after the last entry).

Example:

>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
...   "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
...   token_out_type=tf.string)
>>> tokenizer.tokenize(["they're", "the", "greatest"])
<tf.RaggedTensor [[b'they', b"##'", b'##re'], [b'the'], [b'great', b'##est']]>
>>> tokenizer.tokenize(["they", "are", "great"])
<tf.RaggedTensor [[b'they'], [b'[UNK]'], [b'great']]>
>>> int_tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
...   token_out_type=tf.int32)
>>> int_tokenizer.tokenize(["the", "greatest"])
<tf.RaggedTensor [[3], [4, 5]]>
>>> int_tokenizer.tokenize(["really", "the", "greatest"])
<tf.RaggedTensor [[6], [3], [4, 5]]>

Tensor or ragged tensor inputs result in ragged tensor outputs. Scalar inputs (which are just a single token) result in tensor outputs.

>>> tokenizer.tokenize("they're")
<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'they', b"##'", b'##re'],
dtype=object)>
>>> tokenizer.tokenize(["they're"])
<tf.RaggedTensor [[b'they', b"##'", b'##re']]>
>>> tokenizer.tokenize(tf.ragged.constant([["they're"]]))
<tf.RaggedTensor [[[b'they', b"##'", b'##re']]]>

Empty strings are tokenized into empty (ragged) tensors.

>>> tokenizer.tokenize([""])
<tf.RaggedTensor [[]]>

Args

`vocab_lookup_table` A lookup table implementing the LookupInterface containing the vocabulary of subwords or a string which is the file path to the vocab.txt file.
`suffix_indicator` (optional) The characters prepended to a wordpiece to indicate that it is a suffix to another subword. Default is '##'.
`max_bytes_per_word` (optional) Max size of input token. Default is 100.
`max_chars_per_token` (optional) Max size of subwords, excluding suffix indicator. If known, providing this improves the efficiency of decoding long words.
`token_out_type` (optional) The type of the token to return. This can be `tf.int64` or `tf.int32` IDs, or `tf.string` subwords. The default is `tf.int64`.
`unknown_token` (optional) The string value to substitute for an unknown token. Default is "[UNK]". If set to `None`, no substitution occurs. If `token_out_type` is `tf.int32`/`tf.int64`, the `vocab_lookup_table` is used (after substitution) to convert the unknown token to an integer.
`split_unknown_characters` (optional) Whether to split out single unknown characters as subtokens. If False (default), words containing unknown characters will be treated as single unknown tokens.

Methods

detokenize

View source

detokenize(
    token_ids
)

Convert a Tensor or RaggedTensor of wordpiece IDs to string-words.

>>> import pathlib
>>> pathlib.Path('/tmp/detok_vocab.txt').write_text(
...     'a b c ##a ##b ##c'.replace(' ', '\n'))
>>> wordpiece = WordpieceTokenizer('/tmp/detok_vocab.txt')
>>> token_ids = [[0, 4, 5, 2, 5, 5, 5]]
>>> wordpiece.detokenize(token_ids)
<tf.RaggedTensor [[b'abc', b'cccc']]>

The word pieces are joined along the innermost axis to make words. So the result has the same rank as the input, but the innermost axis of the result indexes words instead of word pieces.

The shape transformation is: [..., wordpieces] => [..., words]

When the input shape is [..., words, wordpieces] (like the output of WordpieceTokenizer.tokenize) the result's shape is [..., words, 1]. The additional ragged axis can be removed using words.merge_dims(-2, -1).

Note: This method assumes wordpiece IDs are dense on the interval [0, vocab_size).

Args
`token_ids` A `RaggedTensor` or `Tensor` with an int dtype. Must have `ndims >= 2`
Returns
A `RaggedTensor` with dtype `string` and the rank as the input `token_ids`.

split

View source

split(
    input
)

Alias for Tokenizer.tokenize.

split_with_offsets

View source

split_with_offsets(
    input
)

Alias for TokenizerWithOffsets.tokenize_with_offsets.

tokenize

View source

tokenize(
    input
)

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example:

>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
...     "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokens = [["they're", 'the', 'greatest']]
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
...                                token_out_type=tf.string)
>>> tokenizer.tokenize(tokens)
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.
Returns
A `RaggedTensor` of tokens where `tokens[i1...iN, j]` is the string contents (or ID in the vocab_lookup_table representing that string) of the `jth` token in `input[i1...iN]`

tokenize_with_offsets

View source

tokenize_with_offsets(
    input
)

Tokenizes a tensor of UTF-8 string tokens further into subword tokens.

Example:

>>> import pathlib
>>> pathlib.Path('/tmp/tok_vocab.txt').write_text(
...     "they ##' ##re the great ##est".replace(' ', '\n'))
>>> tokens = [["they're", 'the', 'greatest']]
>>> tokenizer = WordpieceTokenizer('/tmp/tok_vocab.txt',
...                                token_out_type=tf.string)
>>> subtokens, starts, ends = tokenizer.tokenize_with_offsets(tokens)
>>> subtokens
<tf.RaggedTensor [[[b'they', b"##'", b'##re'], [b'the'],
                   [b'great', b'##est']]]>
>>> starts
<tf.RaggedTensor [[[0, 4, 5], [0], [0, 5]]]>
>>> ends
<tf.RaggedTensor [[[4, 5, 7], [3], [5, 8]]]>
Args
`input` An N-dimensional `Tensor` or `RaggedTensor` of UTF-8 strings.

Returns
A tuple `(tokens, start_offsets, end_offsets)` where:

tokens[i1...iN, j]: is a RaggedTensor of the string contents (or ID in the vocab_lookup_table representing that string) of the jth token in input[i1...iN]. start_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the inclusive start of the jth token in input[i1...iN]. end_offsets[i1...iN, j]: is a RaggedTensor of the byte offsets for the exclusive end of the jth token in input[i...iN]` (exclusive, i.e., first byte after the end of the token).

vocab_size

View source

vocab_size(
    name=None
)

Returns the vocabulary size.

Args
`name` The name argument that is passed to the op function.
Returns
A scalar representing the vocabulary size.