Skip to content

(sort of) a bug with token offsets: some special tokens have (0, 0) offsets regardless of their position in the document #35125

@viktor-shcherb

Description

@viktor-shcherb

System Info

  • transformers version: 4.46.3
  • Platform: macOS-14.4-arm64-arm-64bit
  • Python version: 3.11.9
  • Huggingface_hub version: 0.26.3
  • Safetensors version: 0.4.5
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

xlm_tok = AutoTokenizer.from_pretrained('facebook/xlm-v-base')

xlm_tok('test', return_offsets_mapping=True)

Output:

{'input_ids': [0, 1340, 2], 'attention_mask': [1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (0, 0)]}

Expected behavior

The special tokens should have the offset that reflects their position in the document. In this case, (4, 4) instead of (0, 0).

Why is it a remotely a big deal? To simplify the code that iterates through the offsets. For example this piece of code will not work now:

def tokenize(example: dict, tokenizer: PreTrainedTokenizer, tokenizer_name: str, max_length: int = 512) -> dict:
    ner_tags: list[int] = example['ner_tags']
    example_words: list[str] = example['tokens']
    text = ' '.join(example_words)
    
    # map words to positions in text
    word_positions: list[int] = example.get('word_positions', [])
    
    if len(word_positions) != len(example_words):
        text_iterator = 0
        for word in example_words:
            while text[text_iterator:text_iterator + len(word)] != word:
                text_iterator += 1
                assert text_iterator < len(text)
            
            word_positions.append(text_iterator)
    
    encoding: BatchEncoding = tokenizer(text, return_offsets_mapping=True, truncation=True, max_length=max_length)
    num_sub_tokens = len(encoding.offset_mapping)
    
    sub_token_iterator = 0
    sub_token_ner_tags: list[int] = []
    for word_id, ner_tag in enumerate(ner_tags):
        word_start = word_positions[word_id]
        word_end = word_start + len(example_words[word_id])
        
        # there may be some empty space between words. the sub tokens that include this empty space receive O label
        # we compare with the end ([1]) to ensure that 0-length tokens are labelled as O (for example <CLS>)
        while sub_token_iterator < num_sub_tokens and  encoding.offset_mapping[sub_token_iterator][1] <= word_start:
            sub_token_iterator += 1
            sub_token_ner_tags.append(0)  # 0 = O
            
        ext_tag = ner_tags_ext[ner_tag]
        
        if sub_token_iterator < num_sub_tokens:
            # the first sub token of a word receives original label, the rest receive extended label
            sub_token_ner_tags.append(ner_tag)
            sub_token_iterator += 1
        
        # again, we need to be careful about 0-length tokens, so we compare start ([0]) with the word end
        while sub_token_iterator < num_sub_tokens and encoding.offset_mapping[sub_token_iterator][0] < word_end:
            sub_token_iterator += 1
            sub_token_ner_tags.append(ext_tag)
    
    # any tokens at the end (like <SEP>) receive O tokens
    while sub_token_iterator < num_sub_tokens:
        sub_token_iterator += 1
        sub_token_ner_tags.append(0)
        
    return {
        'word_positions': word_positions,
        f'{tokenizer_name}_sub_tokens': encoding.input_ids,
        f'{tokenizer_name}_sub_token_offsets': encoding.offset_mapping,
        f'{tokenizer_name}_sub_token_ner_tags': sub_token_ner_tags,
    }

It produces entities like Peter Blackburn</s> for "Peter Blackburn" text. Not a big deal, but annoying to work around and hard to catch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    UsageGeneral questions about the librarybug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions