In software engineering, decreasing cycle time has a super-linear effect on progress. In modern deep learning, cycle time is often on the order of hours or days. The easiest way to speed up training, data parallelism, is to distribute copies of the model across GPUs and machines and have each copy compute the loss on a shard of the training data. The gradients from these losses can then be accumul
{{#tags}}- {{label}}
{{/tags}}