Skip to content

v0.5.0

Compare
Choose a tag to compare
@github-actions github-actions released this 10 Feb 02:02
· 401 commits to main since this release

What's new

Added 🎉

  • Added TrainingEngine abstraction to torch integration.
  • Added FairScale with a FairScaleTrainingEngine
    that leverages FairScale's FullyShardedDataParallel. This is meant to be used within the TorchTrainStep.
  • All PyTorch components (such as learning rate schedulers, optimizers, data collators, etc) from the
    transformers library and now registered under the corresponding class in the torch integration.
    For example, transformers Adafactor optimizer is registered as an Optimizer under the name
    "transformers::Adafactor". More details can be found in the documentation for the transformers integration.

Changed ⚠️

  • Various changes to the parameters othe TorchTrainStep due to the introduction of the TrainingEngine class.
  • Params logged as DEBUG level instead of INFO to reduce noise in logs.
  • The waiting message for FileLock is now clear about which file it's waiting for.
  • Added an easier way to get the default Tango global config
  • Most methods to TorchTrainCallback also take an epoch parameter now.
  • WandbTrainCallback now logs peak GPU memory occupied by PyTorch tensors per worker. This is useful because W&B's system metrics only display the total GPU memory reserved by PyTorch, which is always higher than the actual amount of GPU memory occupied by tensors. So these new metrics give a more accurate view into how much memory your training job is actually using.
  • Plain old Python functions can now be used in Lazy objects.
  • LocalWorkspace now creates a symlink to the outputs of the latest run.
  • Tango is now better at guessing when a step has died and should be re-run.
  • Tango is now more lenient about registering the same class under the same name twice.
  • When you use dict instead of Dict in your type annotations, you now get a legible error message. Same for List, Tuple, and Set.

Fixed ✅

  • Fixed a bug in Registrable and FromParams where registered function constructors would not properly construct
    arguments that were classes.
  • Fixed a bug in FromParams that would cause a crash when an argument to the constructor had the name params.
  • Made FromParams more efficient by only trying to parse the params as a Step when it looks like it actually could be a step.
  • Fixed bug where Executor would crash if git command could not be found.
  • Fixed bug where validation settings were not interpreted the right way by the torch trainer.
  • When you register the same name twice using Registrable, you get an error message. That error message now contains the correct class name.

Commits

a39a69f Merge pull request #161 from allenai/FromParamsDuJour
3063a92 CHANGELOG quick fix
cd006ae Add TrainEngine abstraction to TorchTrainStep, add FairScale integration, improve transformers integration (#77)
93438eb Update setuptools requirement from <=59.5.0 to <60.8.0 (#170)
e57dd91 Bump sphinx-copybutton from 0.4.0 to 0.5.0 (#174)
a8b1bdc split Docker build into seperate workflow, only run when necessary (#178)
59c91f7 make install comments work on all shells (#179)
a059416 Merge pull request #160 from allenai/GuessStepDirBetter
de7195d more fixes for conda-forge (#177)
75e9d42 use conda in Docker image, multi-stage build (#172)
611e446 Merge pull request #176 from allenai/latest-outputs
7241d20 Merge pull request #175 from allenai/self-contained-tests
83aa692 Merge pull request #153 from allenai/LazyWithoutFromParams
893e601 use virtualenv within Docker (#167)
178b8bd Merge pull request #171 from allenai/LenientRegister
6c765c8 Merge pull request #169 from allenai/InformativeFileLock
91ff7ac Merge pull request #168 from allenai/DefaultGlobalConfig
5d602fb push Docker images to GHCR.io (#166)
2b26fc8 set 'resume' to 'allow' instead of 'auto' (#155)
26771e7 fix bug when git missing (#163)
9009119 Add Dockerfile (#162)
a02155d Add a required flag to the README for gpt2-example (#159)