Skip to content

trainer resume from checkpoint,the learning rate is not the same as retraining,learning rate is discontinuous #34053

@LBJ6666

Description

@LBJ6666

System Info

  • Platform: Windows-10
  • transformers version: 4.43.4
  • Python version: 3.10.11
  • PyTorch version (GPU?): 2.3.1+cu121

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

the Trainer does not set a warmup and the lr_scheduler is set to linear, and the training is continued from an interruption to complete all steps, the learning rates will be different from those when training all steps from the beginning. Here are the specific learning rates:

Learning rates for training from the beginning for each step:

  • Step 1: "learning_rate": 1e-05,
  • Step 2: "learning_rate": 1e-05,
  • Step 3: "learning_rate": 9e-06,
  • Step 4: "learning_rate": 8.000000000000001e-06,
  • Step 5: "learning_rate": 7e-06,
  • Step 6: "learning_rate": 6e-06,
  • Step 7: "learning_rate": 5e-06,
  • Step 8: "learning_rate": 4.000000000000001e-06,
  • Step 9: "learning_rate": 3e-06,
  • Step 10: "learning_rate": 2.0000.

If training is continued from a checkpoint at step 5, the learning rates for each step are:

  • Step 6: "learning_rate": 7e-06,
  • Step 7: "learning_rate": 7e-06,
  • Step 8: "learning_rate": 6e-06,
  • Step 9: "learning_rate": 5e-06,
  • Step 10: "learning_rate": 4.000000000000001e-06.

Why are the learning rates for step 6 and step 7 different when training continues from a checkpoint compared to training from the start?

Reproduction steps:

  1. Train from the beginning for 10 steps, save a checkpoint for each step, and record the learning rate in each step.
  2. Delete the checkpoints for steps 6 through 7 in the folder.
  3. Then use trainer.train(resume_from_checkpoint=True) to continue training from step 5, and after training is completed, record the learning rate in the new checkpoint.

Expected behavior

Please explain why the learning rate is not continuous as it is when training from the beginning, for example:
Step 6: "learning_rate": 6e-06,
Step 7: "learning_rate": 5e-06.
........

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions