All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added the
Workspace.remove_step()
method to safely remove steps. - The
GSWorkspace()
can now be initialized with google cloud bucket subfolders.
- The
BeakerExecutor
now uses the HEAD commit at the time the executor is instantiated to executor a step instead of the HEAD commit at the time the step is run.
- Removed unnecessary code coverage dev requirements.
- Fixed issue where new version of torch caused no LR schedulers to be registered.
- Updated pinned versions of jax, jaxlib, and flax.
v1.2.1 - 2023-04-06
- Added the following workspace methods to support the Tango viz UI:
Workspace.search_registered_runs()
,Workspace.search_step_info()
,Workspace.num_registered_runs()
, andWorkspace.num_steps()
.
- Fixes a bug where
FromParams
would fail to parse when an object takes aStep
argument directly. - Changed a name so we don't override the built-in name
set
. - Fixed a bug that would cause O(n^2) memory consumption in dense step graphs.
v1.2.0 - 2023-02-10
- You can now add arguments to steps without invalidating the cache. See
Step.SKIP_DEFAULT_ARGUMENTS
. - Fixed integration status messages in
tango info
command. - Added abstractions for
RemoteClient
,RemoteStepCache
, andRemoteWorkspace
. - Added a GS integration that comes with
GSWorkspace
, a remoteWorkspace
implementation that uses google cloud storage. - You can now bind functional steps to the underlying
Step
instance with@step(bind=True)
, meaning the first argument to the function will be aStep
. - Added
ShellStep
for running arbitrary shell commands. - Added
@make_registrable
decorator to make arbitrary functions registrable, to make it easier to refer to them in tango configurations.
- Jsonnet parsing is now much faster and works on Windows.
- Warnings about locks are now reliably printed every 30 seconds
- We now make sure Beaker jobs have the latest version of beaker-py, so that we're compatible with the latest API changes.
- Stopping early now works when the metric doesn't change at all.
- Fixed bug with
FromParams
which didn't handle variable length tuples correctly.
- The default log level for Tango is now
warning
. - You can specify multiple steps with
-s
from thetango run
command.
v1.1.0 - 2022-12-01
- Added
gpu_type
field toStepResources
. TheBeakerExecutor
can use this to determine which clusters to a submit a step to. - Added
machine
field toStepResources
. You can set this to "local" when using theBeakerExecutor
to force it to run the step locally. - Added
--ext-var
argument totango run
for setting JSONNET external variables when loading the experiment config. - Added
@step()
decorator to createStep
classes from functions. - Added the
transformers::with_soft_prompt
integration, to make soft-prompted prefix transformers easy.
- Removed PyTorch Lightning integration.
- Removed
tango server
command and--serve/--no-serve
option fortango run
. - Removed
source_release.py
, which was checked in by accident.
- Fixed issue where Executor
parallelism
option in a Tango settings file would be ignored. - Fixed a bug where the unique ID of a step that depends on a key-value of the result of another step could change if the name of the other step changes.
- Fixed a bug where importing certain libraries (like torchmetrics) would mess with our exception handling because they set
sys.excepthook
for some reason. Now we always resetsys.excepthook
after importing. - The type hints for the flax trainer suggested that the training split is optional when in fact it's mandatory.
- Made
BeakerWorkspace
/BeakerStepLock
more robust when a job is preempted. - Minor performance improvements for the Beaker executor and workspace.
- Fixed bug with
step_extra_dependencies
where uncacheable dependencies wouldn't be run.
v1.0.2 - 2022-11-14
BeakerScheduler
can now return a list of clusters.
v1.0.1 - 2022-10-20
LightningTrainStep
now can take aLazy
model object which results in a gauranteed deterministic hash.- Fixed issue where remote
Workspace
implementations likeWandbWorkspace
andBeakerWorkspace
would use the same local cache regardless of the W&B / Beaker workspace being used. - Fixed bug with
TorchEvalStep
when constructing callbacks. - Fixed some import error issues caused when an integration is not installed.
- Fix incorrect reporting of final results in
MulticoreExecutor
.
- Wandb step cache retries api call in case of timeout
beaker-py >= 1.11
required.
v1.0.0 - 2022-10-05
- Added
step_extra_dependencies
input field toStep
class that can be used to force a dependency on another step even if the current step doesn't directly depend on the output of the other step. See #418 for more context.
beaker-py >= 1.10
required.
- Long log lines will be soft-wrapped to ensure that links are clickable.
- Fixed a bug where some workspaces could be left in a bad state if a step's
Format
failed to serialize the step's result inWorkspace.step_finished()
. - Sometimes functions and methods end up as arguments to steps, which means we have to hash them. Instead of taking a hash of the function, we now take a hash of the function's module and name.
- Fixed a bug with the Beaker executor where it would hang at the end of a run if a step failed that is a dependency of another step.
- Fixed tests to work with new version of transformers.
- Fixed
Executor.execute_sub_graph_for_step()
to be able to run the step's dependencies in parallel.
v0.14.0 - 2022-09-20
- Adds a function to modify a Hugging Face transformer with IA3 adaptors
- Added a
BeakerScheduler
registrable class, specified as the argumentscheduler
toBeakerExecutor
, which controls the resources assigned to steps ran on Beaker. Users can implement their ownBeakerScheduler
subclasses to customize the resource assignment behavior.
- In the
tango run
command,--no-server
is now the default. Use--server
to start the server.
- Made
BeakerExecutor
more robust to connection, timeout, SSL, and other recoverable HTTP errors. - Made the
BeakerStepLock
more robust, and as a resultBeakerWorkspace
is more robust and should require less manual intervention for locks in a bad state. - Fixed a bug with the internal scheduling logic of the
BeakerExecutor
which could delay submitting some steps in parallel. - Fixed a bug where creating a
StepInfo
object from params might result in unnecessary imports. - Fixed a bug where canceling the Beaker executor might not work properly.
- Fixed a bug where the trainer trains too much when
train_epochs
is set and you're using gradient accumulation. - Fixed a bug where included modules might not be found when using multiprocessing when they're not on
sys.path
/PYTHONPATH
. - Fixed how the results of uncacheable steps are displayed by
tango run
. - Beaker executor won't run duplicate cacheable steps at the same time.
v0.13.0 - 2022-09-07
- You can now reference into a particular index of the result of another step in a config. For example:
{type: "ref", ref: "some_previous_step", key: 0}
. The key field can be an integer if the result of the referenced step is a list or tuple, or a string if the result of the referenced step is a dictionary. - Added
priority
parameter to Beaker executor for setting the default task priority for Beaker jobs. - Added
Workspace.step_result()
method for getting a step's result from the latest run. tango run
will now display a URL to the logs for failed steps when you use theBeakerExecutor
.
- The
TorchTrainStep
now enables monitoring arbitrary model outputs during training.TorchTrainEngine.forward_train
now returns a tupleloss, model_outputs
for each micro batch and the list of model outputs for all micro batches in a batch is passed to theTrainCallback.log_batch
andTrainCallback.post_batch
. - Tango will now automatically search Python modules in the current working directory
for registered classes so that you don't always need to use the
--include-package
setting. - The minimum supported Python version is now 3.8.
- Added support for PyTorch Lightning 1.7.x
- The Beaker Executor will no-longer live-stream logs from Beaker jobs, but logs will be viewable on Beaker and more readable.
- Only the Beaker executor requires a clean working directory
- Fixed a bug that did not allow a wandb artifact's type to be set from a step's metadata dictionary.
- Fixed a bug with how the Beaker executor streams log lines from Beaker which sometimes resulted in messages missing some starting characters, and tqdm lines being duplicated.
- Fixed a bug in the Beaker workspace where the lock dataset wouldn't be removed if the step was found to be in an invalid state.
- Improved cluster choice logic in
BeakerExecutor
to ensure greater diversity of clusters when submitting many steps at once. - Fixed bug where sub-processes of the multicore executor would use the wrong executor if
executor
was defined in atango.yml
file. - Deterministic hashes for numpy and torch tensors were not deterministic. Now they are.
v0.12.0 - 2022-08-23
- Step resources:
- Added a
step_resources
parameter to theStep
class which should be used to describe the computational resources required to run a step.Executor
implementations can use this information. For example, if your step needs 2 GPUs, you should setstep_resources=StepResources(gpu_count=2)
("step_resources": {"gpu_count": 2}
in the configuration language). - Added a
Step.resources()
property method. By default this returns the value specified by thestep_resources
parameter. If your step implementation always requires the same resources, you can just override this method so you don't have to provide thestep_resources
parameter.
- Added a
- Step execution:
- Added an
executor
field to thetango.yml
settings. You can use this to define the executor you want to use by default. - Added a Beaker
Executor
to the Beaker integration, registered as anExecutor
with the name "beaker". To use this executor, add these lines to yourtango.yml
file:See the docs for theexecutor: type: beaker beaker_workspace: ai2/my-workspace clusters: - ai2/general-cirrascale
BeakerExecutor
for more information on the input parameters.
- Added an
- Step class:
- Added a metadata field to the step class API. This can be set through the class
variable
METADATA
or through the constructor argumentstep_metadata
.
- Added a metadata field to the step class API. This can be set through the class
variable
- Weights & Biases integration:
- You can now change the artifact kind for step result artifacts by adding a field called "artifact_kind" to a step's metadata. For models, setting "artifact_kind" to "model" will add the corresponding artifact to W&B's new model zoo.
- CLI:
- The
tango run
command will throw an error if you have uncommitted changes in your repository, unless you use the--allow-dirty
flag. - The
tango run
command will use the lightweight base executor (single process) by default. To use the multi-process executor, set-j/--parallelism
to 1 or higher or -1 to use all available CPU cores.
- The
- Fixed bug where
StepInfo
environment and platform metadata could be out-of-date if a step is run again due to failure. - Fixed a bug where an unfortunate combination of early stopping and decreasing model performance could result in a crash in the torch trainer.
v0.11.0 - 2022-08-04
- Added a Flax integration along with an example config.
v0.10.1 - 2022-07-26
- Fixed issue where the StepInfo config argument could be parsed into a Step.
- Restored capability to run tests out-of-tree.
v0.10.0 - 2022-07-07
- Renamed
workspace
parameter ofBeakerWorkspace
class tobeaker_workspace
. Executor
class is now aRegistrable
base class.MulticoreExecutor
is registered as "multicore".
- Removed
StepExecutionMetadata
. Its fields have been absorbed intoStepInfo
.
- Improved
Step.ensure_result()
such that the step's result doesn't have to be read from the cache. - Fixed an issue with the output from
MulticoreExecutor
such that it's now consistent with the defaultExecutor
for steps that were found in the cache. - One of our error messages referred to a configuration file that no longer exists.
- Improved performance of
BeakerWorkspace
.
- Added the ability to train straight
Model
instead of justLazy[Model]
v0.9.1 - 2022-06-24
- Fixed non-deterministic behavior in
TorchTrainStep
. - Fixed bug in
BeakerWorkspace
where.step_info(step)
would raise aKeyError
if the step hasn't been registered as part of a run yet. - Fixed a bug in
BeakerWorkspace
where it would send too many requests to the beaker service. - Fixed a bug where
WandbWorkspace.step_finished()
or.step_failed()
would crash if called from a different process than.step_starting()
. - Fixed a bug in
WandbWorkspace.step_finished()
which led to aRuntimeError
sometimes while caching the result of a step.
v0.9.0 - 2022-06-01
- Added a Beaker integration that comes with
BeakerWorkspace
, a remoteWorkspace
implementation that uses Beaker Datasets under the hood. - Added a
datasets::dataset_remix
step that provides the split remixing functionality oftango.steps.datasest_remix.DatasetRemixStep
now for HuggingfaceDatasetDict
. - Added a config and code example of Registrable to the First Step docs with edits for clarity.
- If you try to import something from a tango integration that is not fully installed due to missing dependencies, an
IntegrationMissingError
will be raised instead ofModuleNotFound
. - You can now set
-j 0
intango run
to disable multicore execution altogether.
- Improved how steps and workspaces handle race conditions when different processes are competing to execute the same step. This would result in a
RuntimeError
before with most workspaces, but now it's handled gracefully. - Fixed bug which caused GradScaler state to not be saved and loaded with checkpoints.
v0.8.0 - 2022-05-19
- Added a Weights & Baises remote
Workspace
implementation:WandbWorkspace
, registered as "wandb". This can be instantiated from a workspace URL in the form "wandb://entity/project". - Added a method
Workspace.step_result_for_run
which gives the result of a step given the run name and step name within that run. - Added property
Workspace.url
, which returns a URL for the workspace that can be used to instantiate the exact same workspace usingWorkspace.from_url()
. Subclasses must implement this.
StepInfo
start and end times will be always be in UTC now.WandbTrainCallback
now logs system metrics from each worker process in distributed training.StepCache.__contains__()
andStepCache.__getitem__()
now take accept either aStep
orStepInfo
as an argument (Union[Step, StepInfo]
).- Refactored
tango.step_graph.StepGraph
to allow initialization from aDict[str, Step]
. Executor.execute_step_graph()
now attempts to execute all steps and summarizes success/failures.
- Fixed bug with
LocalWorkspace.from_parsed_url()
(#278). - Deprecation warnings will now be logged from
tango
CLI. - Fixed the text format in the case of serializing an iterator of string.
- Added missing default value of
None
toTangoGlobalSettings.find_or_default()
. - Mypy has become incompatible with transformers and datasets, so we have to disable the checks in some places.
- The
VERSION
member of step arguments that were wrapped inLazy
were not respected. Now they are.
v0.7.0 - 2022-04-19
- Added the "-n/--name" option to
tango run
. This option allows the user to give the run an arbitrary name. - Added a convenience property
.workspace
toStep
class that can be called from a step's.run()
method to get the currentWorkspace
being used. - Gave
FromParams
objects (which includes allRegistrable
objects) the ability to version themselves. - Added CLI option to run a single step in a config using
--step-name
or-s
. - Added a
MultiCoreExecutor
that executes steps in parallel. - Added an
ExecutorOutput
dataclass that is returned byExecutor.execute_step_graph()
. StepGraph
now prints itself in a readable way.- Tango now automatically detects when it's running under a debugger, and disables multicore support accordingly. Many debuggers can't properly follow sub-processes, so this is a convenience for people who love debuggers.
- Added more models to the stuff we can import from the transformers library.
- Added new example for finetuning text-to-text models.
- Renamed
click_logger
tocli_logger
, and we now use rich's loggingHandler
as the default handler, which means prettier output, better tracebacks, and you can use rich's markup syntax with thecli_logger
to easily add style to text. - Refactored
tango.step_graph.StepGraph
to allow initialization from aDict[str, Step]
. Executor.execute_step_graph()
now attempts to execute all steps and summarizes success/failures.- Upgraded PyTorch version in
tango
Docker image to latestv1.11.0+cu113
. RunGeneration
now allows model object as input.
- Fixed bug that mistakenly disallowed fully-qualified names containing
"_"
(underscores) in the config. - Fixed bug where
TorchTrainStep
working directory would be left in an unrecoverable state if training failed after saving the final model weights. - Fixed bug in
FromParams
where**kwargs
might be passed down to the constructors of arguments. - Fixed bug in the way dependencies are tracked between steps.
- Fixed bug that caused
MulticoreExecutor
to hang in case of a failing step that was required recursively (not directly) downstream. - Fixed bug in the way dependencies are tracked between steps
- Compatibility with PyTorch Lightning 1.6
v0.6.0 - 2022-02-25
- New example that finetunes a pre-trained ResNet model on the Cats & Dogs dataset.
- Added a '@requires_gpus' decorator for marking tests as needing GPUs. Tests marked with this will be run in the "GPU Tests" workflow on dual k80 GPUs via Beaker.
- Added the "-w/--workspace" option to
tango run
andtango server
commands. This option takes a path or URL, and instantiates the workspace from the URL using the newly addedWorkspace.from_url()
method. - Added the "workspace" field to
TangoGlobalSettings
. - Added the "environment" field to
TangoGlobalSettings
for setting environment variables each timetango
is run. - Added a utility function to get a
StepGraph
directly from a file. - Added
tango.settings
module andtango settings
group of commands. - A format for storing sequences as
SqliteSparseSequence
- A way to massage kwargs before they determine the unique ID of a
Step
local_workspace.ExecutorMetadata
renamed toStepExecutionMetadata
and now saved asexecution-metadata.json
.tango run
without the option "-w/--workspace" or "-d/--workspace-dir" will now use aMemoryWorkspace
instead of aLocalWorkspace
in a temp directory, unless you've specified a default workspace in aTangoGlobalSettings
file.- Moved
tango.workspace.MemoryWorkspace
andtango.local_workspace.LocalWorkspace
totango.workspaces.*
. - Moved
tango.step_cache.MemoryStepCache
andtango.step_cache.LocalStepCache
totango.step_caches.*
. - Deprecated the
-d/--workspace-dir
command-line option. Please use-w/--workspace
instead.
- Fixed a small bug
LocalWorkspace
would fail to capture the conda environment in our Docker image. - Fixed activation of
FILE_FRIENDLY_LOGGING
when set from the corresponding environment variable. - Fixed setting log level via the environment variable
TANGO_LOG_LEVEL
. - Use relative paths within the
work_dir
for symbolic links to the latest and the best checkpoints inTorchTrainStep
. - Fixed some scenarios where Tango can hang after finishing all steps.
distributed_port
andlog_every
parameters won't factor intoTorchTrainStep
's unique ID.MappedSequence
now works with slicing.MappedSequence
now works with HuggingfaceDataset
.- Uncacheable steps are now visible in Tango UI.
- Fixed bug in
Registrable.list_available()
where an error might be raised if the default implementation hadn't been explicitly imported. - Fixed issue where having a default argument to the
run()
method wasn't getting applied to the step's unique ID.
v0.5.0 - 2022-02-09
- Added
TrainingEngine
abstraction to torch integration. - Added FairScale with a
FairScaleTrainingEngine
that leverages FairScale'sFullyShardedDataParallel
. This is meant to be used within theTorchTrainStep
. - All PyTorch components (such as learning rate schedulers, optimizers, data collators, etc) from the
transformers library and now registered under the corresponding class in the torch integration.
For example, transformers
Adafactor
optimizer is registered as anOptimizer
under the name "transformers::Adafactor". More details can be found in the documentation for the transformers integration.
- Various changes to the parameters othe
TorchTrainStep
due to the introduction of theTrainingEngine
class. - Params logged as
DEBUG
level instead ofINFO
to reduce noise in logs. - The waiting message for
FileLock
is now clear about which file it's waiting for. - Added an easier way to get the default Tango global config
- Most methods to
TorchTrainCallback
also take anepoch
parameter now. WandbTrainCallback
now logs peak GPU memory occupied by PyTorch tensors per worker. This is useful because W&B's system metrics only display the total GPU memory reserved by PyTorch, which is always higher than the actual amount of GPU memory occupied by tensors. So these new metrics give a more accurate view into how much memory your training job is actually using.- Plain old Python functions can now be used in
Lazy
objects. LocalWorkspace
now creates a symlink to the outputs of the latest run.- Tango is now better at guessing when a step has died and should be re-run.
- Tango is now more lenient about registering the same class under the same name twice.
- When you use
dict
instead ofDict
in your type annotations, you now get a legible error message. Same forList
,Tuple
, andSet
.
- Fixed a bug in
Registrable
andFromParams
where registered function constructors would not properly construct arguments that were classes. - Fixed a bug in
FromParams
that would cause a crash when an argument to the constructor had the nameparams
. - Made
FromParams
more efficient by only trying to parse the params as aStep
when it looks like it actually could be a step. - Fixed bug where
Executor
would crash ifgit
command could not be found. - Fixed bug where validation settings were not interpreted the right way by the torch trainer.
- When you register the same name twice using
Registrable
, you get an error message. That error message now contains the correct class name.
v0.4.0 - 2022-01-27
- Default log level is
WARNING
instead ofERROR
. - The web UI now renders the step graph left-to-right.
- The web UI now shows runs by date, with the most recent run at the top.
- The web UI now shows steps in a color-coded way.
- The
tango run
command now prints user-friendly paths if possible. - The
--include-package
flag now also accepts paths instead of module names. tango.common.sqlite_sparse_sequence.SqliteSparseSequence
now lives attango.common.sequences.SqliteSparseSequence
.
- Ensure tqdm log lines always make it into the log file
out.log
even when log level isWARNING
orERROR
. - Numerous parts of Tango now have documentation when they didn't before.
v0.4.0rc5 - 2022-01-19
- Added
TorchEvalStep
to torch integration, registered as "torch::eval".
- Renamed
aggregate_val_metric
toauto_aggregate_val_metric
inTorchTrainStep
. devices
parameter toTorchTrainStep
replaced withdevice_count: int
.- Run name printed at the end of a run so it's easier to find.
- Type information added to package data. See PEP 561 for more information.
- A new integration,
transformers
, with two new steps for running seq2seq models. - Added
logging_tqdm
, if you don't want a progress bar, but you still want to see progress in the logs. - Added
threaded_generator()
, for wrapping generators so that they run in a separate thread from the generator's consumer. - Added a new example for evaluating the T0 model on XSum, a summarization task.
- Added
MappedSequence
for functionally wrapping sequences. - Added
TextFormat
, in case you want to store the output of your steps in raw text instead of JSON. - Steps can now list arguments in
SKIP_ID_ARGUMENTS
to indicate that the argument should not affect a step's unique id. This is useful for arguments that affect the execution of a step, but not the output. Step
now implements__str__
, so steps look pretty in the debugger.- Added
DatasetCombineStep
, a step that combines multiple datasets into one. - Added
common.logging.initialize_worker_logging()
function for configuring logging from worker processes/threads. - Logs from
tango run ...
will be written to a file calledout.log
in the run directory.
- Fixed torch
StopEarlyCallback
state not being recovered properly on restarts. - Fixed file friendly logging by removing special styling characters.
- Ensured exceptions captured in logs.
LocalWorkspace
now works properly with uncacheable steps.- When a Tango run got killed hard, with
kill -9
, or because the machine lost power,LocalWorkspace
would sometimes keep a step marked as "running", preventing further executions. This still happens sometimes, but it is now much less likely (and Tango gives you instructions for how to fix it). - To make all this happen,
LocalWorkspace
now saves step info in a Sqlite database. Unfortunately that means that the workspace format changes and existing workspace directories won't work properly with it. - Fixed premature cleanup of temporary directories when using
MemoryWorkspace
v0.4.0rc4 - 2021-12-20
- Fixed a bug where
StepInfo
fails to deserialize whenerror
is an exception that can't be pickled.
v0.4.0rc3 - 2021-12-15
- Added
DatasetsFormat
format andLoadStreamingDataset
step todatasets
integration. SqliteDictFormat
for datasets.- Added
pre_epoch()
andpost_epoch()
callback methods to PyTorchTrainCallback
.
LoadDataset
step fromdatasets
integration is now cacheable, using theDatasetsFormat
format by default. But this only works with non-streaming datasets. For streaming datasets, you should use theLoadStreamingDataset
step instead.
- Fixed bug where
KeyboardInterrupt
exceptions were not handled properly by steps and workspaces. WandbTrainCallback
now will use part of the step's unique ID as the name for the W&B run by default, to make it easier to indentify which tango step corresponds to each run in W&B.WandbTrainCallback
will save the entireTrainConfig
object to the W&B config.
v0.4.0rc2 - 2021-12-13
- Sample experiment configurations that prove Euler's identity
- Loosened
Click
dependency to include v7.0. - Loosened
datasets
dependency. - Tightened
petname
dependency to exclude next major release for safety.
Workspace
,MemoryWorkspace
, andLocalWorkspace
can now be imported directly from thetango
base module.- Uncacheable leaf steps would never get executed. This is now fixed.
- We were treating failed steps as if they were completed by accident.
- The visualization had a problem with showing steps that never executed because a dependency failed.
- Fixed a bug where
Lazy
inputs to aStep
would fail to resolve arguments that come from the result of another step. - Fixed a bug in
TorchTrainStep
where some arguments for distributed training (devices
,distributed_port
) weren't being set properly.
v0.4.0rc1 - 2021-11-30
- Introduced the concept of the
Workspace
, withLocalWorkspace
andMemoryWorkspace
as initial implementations. - Added a stub of a webserver that will be able to visualize runs as they happen.
- Added separate classes for
LightningTrainingTypePlugin
,LightningPrecisionPlugin
,LightningClusterEnvironmentPlugin
,LightningCheckpointPlugin
for compatibility withpytorch-lightning>=1.5.0
. - Added a visualization of workspaces that can show step graphs while they're executing.
- Removed old
LightningPlugin
class - Removed requirement of the
overrides
package
- Made it possible to construct a step graph out of
Step
objects, instead of constructing it out ofStepStub
objects. - Removed dataset fingerprinting code, since we can now use
Step
to make sure things are cached. - Made steps deterministic by default.
- Brought back
MemoryStepCache
, so we can run steps without configuring anything. - W&B
torch::TrainCallback
logs withstep=step+1
now so that training curves in the W&B dashboard match up with checkpoints saved locally and are easier to read (e.g. step 10000 instead of 9999). filelock >= 3.4
required, parameterpoll_intervall
totango.common.file_lock.FileLock.acquire
renamed topoll_interval
.
- Fixed bug in
FromParams
where a parameter to aFromParams
class may not be instantiated correctly if it's a class with a generic type parameter.
v0.3.6 - 2021-11-12
- Added a
.log_batch()
method ontorch::TrainCallback
which is given the average loss across distributed workers, but only called everylog_every
steps.
- Removed
.pre_log_batch()
method ontorch::TrainCallback
.
- Fixed typo in parameter name
remove_stale_checkpoints
inTorchTrainStep
(previously wasremove_state_checkpoints
). - Fixed bug in
FromParams
that would cause failures whenfrom __future__ import annotations
was used with Python older than 3.10. See PEP 563 for details.
v0.3.5 - 2021-11-05
- Fixed a bug in
FromParams
where the "type" parameter was ignored in some cases where theRegistrable
base class did not directly inherit fromRegistrable
.
v0.3.4 - 2021-11-04
- Added
StopEarlyCallback
, atorch::TrainCallback
for early stopping. - Added parameter
remove_stale_checkpoints
toTorchTrainStep
.
- Minor changes to
torch::TrainCallback
interface. - Weights & Biases
torch::TrainCallback
now logs best validation metric score.
v0.3.3 - 2021-11-04
- Added support for PEP 604 in
FromParams
, i.e. writing union types as "X | Y" instead of "Union[X, Y]". - [internals] Added a spot for miscellaneous end-to-end integration tests (not to be confused with "tests of integrations") in
tests/end_to_end/
. - [internals] Core tests now run on all officially supported Python versions.
- Fixed a bug in
FromParams
where non-FromParams
class parameters were not instantiated properly (or at all). - Fixed a bug in
FromParams
where kwargs were not passed on from a wrapper class to the wrapped class. - Fixed small bug where some errors from git would be printed when executor metadata is created outside of a git repository.
v0.3.2 - 2021-11-01
- Fixed a bug with
FromParams
that caused.from_params()
to fail when the params contained an object that was already instantiated. - tango command no longer installs a SIGTERM handler, which fixes some bugs with integrations that use multiprocessing.
v0.3.1 - 2021-10-29
- Updated the
LightningTrainStep
to optionally take in aLightningDataModule
as input.
v0.3.0 - 2021-10-28
- Added
IterableDatasetDict
, a version ofDatasetDict
for streaming-like datasets. - Added a PyTorch Lightning integration with
LightningTrainStep
.
- Fixed bug with
FromParams
andLazy
where extra arguments would sometimes be passed down through to aLazy
class when they shouldn't.
v0.2.4 - 2021-10-22
- Added support for torch 1.10.0.
--file-friendly-logging
flag is now an option to the maintango
command, so needs to be passed beforerun
, e.g.tango --file-friendly-logging run ...
.
- Fixed bug with
Step.from_params
. - Ensure logging is initialized is spawn processes during distributed training with
TorchTrainStep
.
v0.2.3 - 2021-10-21
- Added support for global settings file,
tango.yml
. - Added 'include_package' (array of string) param to config spec.
- Added a custom error
StopEarly
that aTrainCallback
can raise within theTorchTrainStep
to stop training early without crashing. - Added step config, tango command, and tango version to executor metadata.
- Executor now also saves pip dependencies and conda environment files to the run directory for each step.
- Ensured
**kwargs
arguments are logged inFromParams
.
v0.2.2 - 2021-10-19
- Added new steps to
datasets
integration:ConcatenateDatasets
("datasets::concatenate") andInterleaveDatasets
(datasets::interleave). - Added
__contains__
and__iter__
methods onDatasetDict
so that it is now aMapping
class. - Added
tango info
command that - among other things - displays which integrations are installed.
v0.2.1 - 2021-10-18
- Added
convert_to_tango_dataset_dict()
function in thedatasets
integration. It's important for step caching purposes to use this to convert a HFDatasetDict
to a native TangoDatasetDict
when thatDatasetDict
is part of the input to another step. Otherwise the HFDatasetDict
will have to be pickled to determine its hash.
Format.checksum()
is now an abstract method. Subclasses should only compute checksum on the serialized artifact and nothing else in the directory.- [internals] Changed the relationship between
Executor
,StepCache
, andStep.
Executor
now owns theStepCache
, andStep
never interacts withStepCache
directly.
v0.2.0 - 2021-10-15
- Added a Weights & Biases integration with a training callback ("wandb::log")
for
TorchTrainStep
("torch::train") that logs training and validation metrics to W&B.
- Fixed
Format.checksum()
when there is a symlink to a directory in the cache folder.
v0.1.3 - 2021-10-15
- Added the ability to track a metric other than "loss" for validation in
TorchTrainStep
("torch::train").
- Final model returned from
TorchTrainStep
("torch::train") will have best weights loaded. - Checkpoints are saved from
TorchTrainStep
("torch::train") even when there is no validation loop. - Fixed
TorchTrainStep
("torch::train") whenvalidation_split
isNone
. - Fixed distributed training with
TorchTrainStep
("torch::train") on GPU devices.
v0.1.2 - 2021-10-13
- Added support for YAML configuration files.
v0.1.1 - 2021-10-12
TorchTrainStep
now displays a progress bar while saving a checkpoint to file.- The default executor now saves a "executor-metadata.json" file to the directory for each step.
- Renamed
DirectoryStepCache
toLocalStepCache
(registered as "local"). LocalStepCache
saves metadata tocache-metadata.json
instead ofmetadata.json
.
- Fixed bug with
TorchTrainStep
during distributed training. FromParams
will automatically convert strings intoPath
types now when the annotation isPath
.
v0.1.0 - 2021-10-11
- Added
StepGraph
andExecutor
abstractions. - Added a basic PyTorch training step registered as
"torch::train"
, along with other registrable components, such asModel
,DataLoader
,Sampler
,DataCollator
,Optimizer
, andLRScheduler
. - Added
DatasetRemixStep
intango.steps
. - Added module
tango.common.sequences
. - Added
DatasetDict
class intango.common.dataset_dict
. - Added 🤗 Datasets integration.
- Added command-line options to set log level or disable logging completely.
Step.work_dir
,Step.unique_id
,Step.dependencies
, andStep.recursive_dependencies
are now a properties instead of methods.tango run
command will acquire a lock on the directory to avoid race conditions.- Integrations can now be installed with
pip install tango[INTEGRATION_NAME]
. For example,pip install tango[torch]
. - Added method
Registrable.search_modules()
for automatically finding and importing the modules where a givenname
might be registered. FromParams.from_params()
andRegistrable.resolve_class_name
will now callRegistrable.search_modules()
to automatically import modules where the type might be defined. Thus for classes that are defined and registered within anytango.*
submodules it is not necessary to explicitly import them.
Step
implementations can now take arbitrary**kwargs
in theirrun()
methods.
v0.0.3 - 2021-09-27
- Added
tango
command.
v0.0.2 - 2021-09-27
- Ported over core tango components from AllenNLP.
v0.0.1 - 2021-09-22
- Added initial project boilerplate.