v0.5.0
What's new
Added 🎉
- Added
TrainingEngine
abstraction to torch integration. - Added FairScale with a
FairScaleTrainingEngine
that leverages FairScale'sFullyShardedDataParallel
. This is meant to be used within theTorchTrainStep
. - All PyTorch components (such as learning rate schedulers, optimizers, data collators, etc) from the
transformers library and now registered under the corresponding class in the torch integration.
For example, transformersAdafactor
optimizer is registered as anOptimizer
under the name
"transformers::Adafactor". More details can be found in the documentation for the transformers integration.
Changed ⚠️
- Various changes to the parameters othe
TorchTrainStep
due to the introduction of theTrainingEngine
class. - Params logged as
DEBUG
level instead ofINFO
to reduce noise in logs. - The waiting message for
FileLock
is now clear about which file it's waiting for. - Added an easier way to get the default Tango global config
- Most methods to
TorchTrainCallback
also take anepoch
parameter now. WandbTrainCallback
now logs peak GPU memory occupied by PyTorch tensors per worker. This is useful because W&B's system metrics only display the total GPU memory reserved by PyTorch, which is always higher than the actual amount of GPU memory occupied by tensors. So these new metrics give a more accurate view into how much memory your training job is actually using.- Plain old Python functions can now be used in
Lazy
objects. LocalWorkspace
now creates a symlink to the outputs of the latest run.- Tango is now better at guessing when a step has died and should be re-run.
- Tango is now more lenient about registering the same class under the same name twice.
- When you use
dict
instead ofDict
in your type annotations, you now get a legible error message. Same forList
,Tuple
, andSet
.
Fixed ✅
- Fixed a bug in
Registrable
andFromParams
where registered function constructors would not properly construct
arguments that were classes. - Fixed a bug in
FromParams
that would cause a crash when an argument to the constructor had the nameparams
. - Made
FromParams
more efficient by only trying to parse the params as aStep
when it looks like it actually could be a step. - Fixed bug where
Executor
would crash ifgit
command could not be found. - Fixed bug where validation settings were not interpreted the right way by the torch trainer.
- When you register the same name twice using
Registrable
, you get an error message. That error message now contains the correct class name.
Commits
a39a69f Merge pull request #161 from allenai/FromParamsDuJour
3063a92 CHANGELOG quick fix
cd006ae Add TrainEngine abstraction to TorchTrainStep, add FairScale integration, improve transformers integration (#77)
93438eb Update setuptools requirement from <=59.5.0 to <60.8.0 (#170)
e57dd91 Bump sphinx-copybutton from 0.4.0 to 0.5.0 (#174)
a8b1bdc split Docker build into seperate workflow, only run when necessary (#178)
59c91f7 make install comments work on all shells (#179)
a059416 Merge pull request #160 from allenai/GuessStepDirBetter
de7195d more fixes for conda-forge (#177)
75e9d42 use conda in Docker image, multi-stage build (#172)
611e446 Merge pull request #176 from allenai/latest-outputs
7241d20 Merge pull request #175 from allenai/self-contained-tests
83aa692 Merge pull request #153 from allenai/LazyWithoutFromParams
893e601 use virtualenv within Docker (#167)
178b8bd Merge pull request #171 from allenai/LenientRegister
6c765c8 Merge pull request #169 from allenai/InformativeFileLock
91ff7ac Merge pull request #168 from allenai/DefaultGlobalConfig
5d602fb push Docker images to GHCR.io (#166)
2b26fc8 set 'resume' to 'allow' instead of 'auto' (#155)
26771e7 fix bug when git missing (#163)
9009119 Add Dockerfile (#162)
a02155d Add a required flag to the README for gpt2-example (#159)