-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Closed
Labels
Description
System Info
transformersversion: 4.45.0- Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
- Python version: 3.10.15
- Huggingface_hub version: 0.26.2
- Safetensors version: 0.4.5
- Accelerate version: 1.0.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0+rocm6.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: AMD Instinct MI250X/MI250
Who can help?
@ArthurZucker and @itazap
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When I load and then save the tokenizer with OLMO models, the tokenizer.json files appear different, particularly with the merge key.
The code to reproduce that is :
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B-0724-hf")
tokenizer.save_pretrained("saved_tokenizer")
Expected behavior
The original tokenizer.json and the saved tokenizer.json should be the same.
