Skip to content

Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors #33807

@oranshayer

Description

@oranshayer

System Info

  • transformers version: 4.44.2
  • Platform: Linux-5.10.220-209.869.amzn2.x86_64-x86_64-with-glibc2.26
  • Python version: 3.10.14
  • Huggingface_hub version: 0.25.1
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A10G

Who can help?

@muellerz @SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am finetuning google/gemma-2-2b and these are the arguments and trainer call:

text_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", token=token, attn_implementation='eager')

training_args = TrainingArguments(
    output_dir=args.log_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    warmup_steps=args.warmup_steps,
    learning_rate=args.learning_rate,
    evaluation_strategy="no",
    logging_dir=args.log_dir,
    logging_steps=50,
    save_strategy="steps",
    save_steps=2000,
    report_to="mlflow",
    run_name=args.run_name,
)

trainer = Trainer(
    model=text_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

I am getting the following error when trainer tries to save the model:

RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'text_model.model.embed_tokens.weight', 'text_model.lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.

I have currently disabled saving as safetensors through the training arguments:
save_safetensors=False,

Expected behavior

Should save in safetensors without raising an error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions