-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Closed
Labels
Description
System Info
transformersversion: 4.47.0- Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
- Python version: 3.12.7
- Huggingface_hub version: 0.26.5
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: N/A
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Spaces are being stripped from space-prefixed token Ġ' when followed by a common abbreviation (e.g., n't, 'm, 's, 've), even when not appropriate to do so. This is being caused because clean_up_tokenization_spaces is True by default for the Llama 3.1 tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
original = " plunged the long 'sword' into"
input_ids = tokenizer.encode(original, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
decoded = tokenizer.decode(input_ids)
decoded2 = tokenizer.decode(input_ids, clean_up_tokenization_spaces=False)
print("token ids: ", input_ids)
print("tokens: ", tokens)
print("original: ", original)
print("decoded (default): ", decoded)
print("decoded (clean_up=False):", decoded2)Produces
token ids: [75803, 279, 1317, 364, 80138, 6, 1139]
tokens: ['Ġplunged', 'Ġthe', 'Ġlong', "Ġ'", 'sword', "'", 'Ġinto']
original: plunged the long 'sword' into
decoded (default): plunged the long'sword' into
decoded (clean_up=False): plunged the long 'sword' into
Expected behavior
I would expect the original string to match the decoded string in all cases unless it actually contains "traditional" tokenization spacing (e.g., it 's vs it's). Perhaps a good approach could be to modify the clean_up_tokenization function to only apply this rule when the common abbreviation is followed immediately by another space.