Skip to content

Detokenization discrepancy with Llama3.1 #35175

@AbrahamSanders

Description

@AbrahamSanders

System Info

  • transformers version: 4.47.0
  • Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35
  • Python version: 3.12.7
  • Huggingface_hub version: 0.26.5
  • Safetensors version: 0.4.5
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: N/A

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Spaces are being stripped from space-prefixed token Ġ' when followed by a common abbreviation (e.g., n't, 'm, 's, 've), even when not appropriate to do so. This is being caused because clean_up_tokenization_spaces is True by default for the Llama 3.1 tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

original = " plunged the long 'sword' into"
input_ids = tokenizer.encode(original, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(input_ids)
decoded = tokenizer.decode(input_ids)
decoded2 = tokenizer.decode(input_ids, clean_up_tokenization_spaces=False)

print("token ids:                ", input_ids)
print("tokens:                   ", tokens)
print("original:                ", original)
print("decoded (default):       ", decoded)
print("decoded (clean_up=False):", decoded2)

Produces

token ids:                 [75803, 279, 1317, 364, 80138, 6, 1139]
tokens:                    ['Ġplunged', 'Ġthe', 'Ġlong', "Ġ'", 'sword', "'", 'Ġinto']
original:                  plunged the long 'sword' into
decoded (default):         plunged the long'sword' into
decoded (clean_up=False):  plunged the long 'sword' into

Expected behavior

I would expect the original string to match the decoded string in all cases unless it actually contains "traditional" tokenization spacing (e.g., it 's vs it's). Perhaps a good approach could be to modify the clean_up_tokenization function to only apply this rule when the common abbreviation is followed immediately by another space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions