-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Closed
Labels
Description
System Info
transformersversion: 4.44.2- Platform: Linux-5.10.220-209.869.amzn2.x86_64-x86_64-with-glibc2.26
- Python version: 3.10.14
- Huggingface_hub version: 0.25.1
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA A10G
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am finetuning google/gemma-2-2b and these are the arguments and trainer call:
text_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", token=token, attn_implementation='eager')
training_args = TrainingArguments(
output_dir=args.log_dir,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.train_batch_size,
per_device_eval_batch_size=args.eval_batch_size,
warmup_steps=args.warmup_steps,
learning_rate=args.learning_rate,
evaluation_strategy="no",
logging_dir=args.log_dir,
logging_steps=50,
save_strategy="steps",
save_steps=2000,
report_to="mlflow",
run_name=args.run_name,
)
trainer = Trainer(
model=text_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
I am getting the following error when trainer tries to save the model:
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'text_model.model.embed_tokens.weight', 'text_model.lm_head.weight'}].
A potential way to correctly save your model is to use `save_model`.
I have currently disabled saving as safetensors through the training arguments:
save_safetensors=False,
Expected behavior
Should save in safetensors without raising an error.
StableFluffyXHMY, ringos and StableFluffy