-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Closed
Labels
Description
System Info
transformersversion: 4.46.3- Platform: macOS-14.4-arm64-arm-64bit
- Python version: 3.11.9
- Huggingface_hub version: 0.26.3
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
xlm_tok = AutoTokenizer.from_pretrained('facebook/xlm-v-base')
xlm_tok('test', return_offsets_mapping=True)Output:
{'input_ids': [0, 1340, 2], 'attention_mask': [1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (0, 0)]}
Expected behavior
The special tokens should have the offset that reflects their position in the document. In this case, (4, 4) instead of (0, 0).
Why is it a remotely a big deal? To simplify the code that iterates through the offsets. For example this piece of code will not work now:
def tokenize(example: dict, tokenizer: PreTrainedTokenizer, tokenizer_name: str, max_length: int = 512) -> dict:
ner_tags: list[int] = example['ner_tags']
example_words: list[str] = example['tokens']
text = ' '.join(example_words)
# map words to positions in text
word_positions: list[int] = example.get('word_positions', [])
if len(word_positions) != len(example_words):
text_iterator = 0
for word in example_words:
while text[text_iterator:text_iterator + len(word)] != word:
text_iterator += 1
assert text_iterator < len(text)
word_positions.append(text_iterator)
encoding: BatchEncoding = tokenizer(text, return_offsets_mapping=True, truncation=True, max_length=max_length)
num_sub_tokens = len(encoding.offset_mapping)
sub_token_iterator = 0
sub_token_ner_tags: list[int] = []
for word_id, ner_tag in enumerate(ner_tags):
word_start = word_positions[word_id]
word_end = word_start + len(example_words[word_id])
# there may be some empty space between words. the sub tokens that include this empty space receive O label
# we compare with the end ([1]) to ensure that 0-length tokens are labelled as O (for example <CLS>)
while sub_token_iterator < num_sub_tokens and encoding.offset_mapping[sub_token_iterator][1] <= word_start:
sub_token_iterator += 1
sub_token_ner_tags.append(0) # 0 = O
ext_tag = ner_tags_ext[ner_tag]
if sub_token_iterator < num_sub_tokens:
# the first sub token of a word receives original label, the rest receive extended label
sub_token_ner_tags.append(ner_tag)
sub_token_iterator += 1
# again, we need to be careful about 0-length tokens, so we compare start ([0]) with the word end
while sub_token_iterator < num_sub_tokens and encoding.offset_mapping[sub_token_iterator][0] < word_end:
sub_token_iterator += 1
sub_token_ner_tags.append(ext_tag)
# any tokens at the end (like <SEP>) receive O tokens
while sub_token_iterator < num_sub_tokens:
sub_token_iterator += 1
sub_token_ner_tags.append(0)
return {
'word_positions': word_positions,
f'{tokenizer_name}_sub_tokens': encoding.input_ids,
f'{tokenizer_name}_sub_token_offsets': encoding.offset_mapping,
f'{tokenizer_name}_sub_token_ner_tags': sub_token_ner_tags,
}It produces entities like Peter Blackburn</s> for "Peter Blackburn" text. Not a big deal, but annoying to work around and hard to catch.