(sort of) a bug with token offsets: some special tokens have (0, 0) offsets regardless of their position in the document

### System Info

- `transformers` version: 4.46.3
- Platform: macOS-14.4-arm64-arm-64bit
- Python version: 3.11.9
- Huggingface_hub version: 0.26.3
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.5.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```python
xlm_tok = AutoTokenizer.from_pretrained('facebook/xlm-v-base')

xlm_tok('test', return_offsets_mapping=True)
```
Output:
```
{'input_ids': [0, 1340, 2], 'attention_mask': [1, 1, 1], 'offset_mapping': [(0, 0), (0, 4), (0, 0)]}
```

### Expected behavior

The special tokens should have the offset that reflects their position in the document. In this case, (4, 4) instead of (0, 0).

Why is it a remotely a big deal? To simplify the code that iterates through the offsets. For example this piece of code will not work now:

```python
def tokenize(example: dict, tokenizer: PreTrainedTokenizer, tokenizer_name: str, max_length: int = 512) -> dict:
    ner_tags: list[int] = example['ner_tags']
    example_words: list[str] = example['tokens']
    text = ' '.join(example_words)
    
    # map words to positions in text
    word_positions: list[int] = example.get('word_positions', [])
    
    if len(word_positions) != len(example_words):
        text_iterator = 0
        for word in example_words:
            while text[text_iterator:text_iterator + len(word)] != word:
                text_iterator += 1
                assert text_iterator < len(text)
            
            word_positions.append(text_iterator)
    
    encoding: BatchEncoding = tokenizer(text, return_offsets_mapping=True, truncation=True, max_length=max_length)
    num_sub_tokens = len(encoding.offset_mapping)
    
    sub_token_iterator = 0
    sub_token_ner_tags: list[int] = []
    for word_id, ner_tag in enumerate(ner_tags):
        word_start = word_positions[word_id]
        word_end = word_start + len(example_words[word_id])
        
        # there may be some empty space between words. the sub tokens that include this empty space receive O label
        # we compare with the end ([1]) to ensure that 0-length tokens are labelled as O (for example <CLS>)
        while sub_token_iterator < num_sub_tokens and  encoding.offset_mapping[sub_token_iterator][1] <= word_start:
            sub_token_iterator += 1
            sub_token_ner_tags.append(0)  # 0 = O
            
        ext_tag = ner_tags_ext[ner_tag]
        
        if sub_token_iterator < num_sub_tokens:
            # the first sub token of a word receives original label, the rest receive extended label
            sub_token_ner_tags.append(ner_tag)
            sub_token_iterator += 1
        
        # again, we need to be careful about 0-length tokens, so we compare start ([0]) with the word end
        while sub_token_iterator < num_sub_tokens and encoding.offset_mapping[sub_token_iterator][0] < word_end:
            sub_token_iterator += 1
            sub_token_ner_tags.append(ext_tag)
    
    # any tokens at the end (like <SEP>) receive O tokens
    while sub_token_iterator < num_sub_tokens:
        sub_token_iterator += 1
        sub_token_ner_tags.append(0)
        
    return {
        'word_positions': word_positions,
        f'{tokenizer_name}_sub_tokens': encoding.input_ids,
        f'{tokenizer_name}_sub_token_offsets': encoding.offset_mapping,
        f'{tokenizer_name}_sub_token_ner_tags': sub_token_ner_tags,
    }
```

It produces entities like `Peter Blackburn</s>` for `"Peter Blackburn"` text. Not a big deal, but annoying to work around and hard to catch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(sort of) a bug with token offsets: some special tokens have (0, 0) offsets regardless of their position in the document #35125

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

(sort of) a bug with token offsets: some special tokens have (0, 0) offsets regardless of their position in the document #35125

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions