Detokenization discrepancy with Llama3.1

### System Info

- `transformers` version: 4.47.0
- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.12.7
- Huggingface_hub version: 0.26.5
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: N/A

### Who can help?

@ArthurZucker @itazap


### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Spaces are being stripped from space-prefixed token `&#288;'` when followed by a common abbreviation (e.g., `n't`, `'m`, `'s`, `'ve`), even when not appropriate to do so. This is being caused because `clean_up_tokenization_spaces` is True by default for the Llama 3.1 tokenizer.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

original = " plunged the long 'sword' into"
input_ids = tokenizer.encode(original, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
decoded = tokenizer.decode(input_ids)
decoded2 = tokenizer.decode(input_ids, clean_up_tokenization_spaces=False)

print("token ids:                ", input_ids)
print("tokens:                   ", tokens)
print("original:                ", original)
print("decoded (default):       ", decoded)
print("decoded (clean_up=False):", decoded2)
```

Produces
```text
token ids:                 [75803, 279, 1317, 364, 80138, 6, 1139]
tokens:                    ['&#288;plunged', '&#288;the', '&#288;long', "&#288;'", 'sword', "'", '&#288;into']
original:                  plunged the long 'sword' into
decoded (default):         plunged the long'sword' into
decoded (clean_up=False):  plunged the long 'sword' into
```

### Expected behavior

I would expect the `original` string to match the `decoded` string in all cases unless it actually contains "traditional" tokenization spacing (e.g., `it 's` vs `it's`). Perhaps a good approach could be to modify the [clean_up_tokenization](https://github.com/huggingface/transformers/blob/5d7739f15a6e50de416977fe2cc9cb516d67edda/src/transformers/tokenization_utils_base.py#L3909) function to only apply this rule when the common abbreviation is followed immediately by another space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detokenization discrepancy with Llama3.1 #35175

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detokenization discrepancy with Llama3.1 #35175

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions