-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize PureNcclCommunicator to accelerate training with double buffering #216
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check my understanding by summarizing changes:
- Remove a thread for asynchronously run
allreduce_grad
in optimizer, replaced by communicator's new interfaceallreduce_grad_async()
, which just runs the allreduce in another stream that have been used for double buffering. - Direct division of all grads right after
allreduce
operation has been replaced bycupy.ElementwiseKernel
implementation. - Also the float width casting (model to allreduce, allreduce to model repectively) have been replaced with
cupy.ElementwiseKernel
implementation.
Which of these were dominantly effective in your thought? And how much did you gain by all these optimizations, if you have any numbers?
I think first item is not that effective but contributes rather to code simplification, and the latter two contributes more to the performance. The latter two also I think contribute to non-dbuf execution; is this correct?
@@ -144,7 +184,7 @@ def _get_nccl_type_id(dtype): | |||
elif dtype == np.float32: | |||
return nccl.NCCL_FLOAT32 | |||
elif dtype == np.float64: | |||
return nccl.NCCL_FLOAT64 | |||
return nccl.NCC_FLOAT64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended? Otherwise please fix this typo.
n_elems) | ||
if stream != chainer.cuda.Stream.null: | ||
needs_sync = self._assign(grad_dtype, allreduce_grad_dtype, n_elems) | ||
if stream != chainer.cuda.Stream.null and needs_sync: | ||
chainer.cuda.Stream.null.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why synchronization is required when double buffering is not enabled? I thought everything happens in null stream in that case... I want to know why and would be nice if the reason is documented here in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This synchronize() runs when double buffering is enabled and CUDA memory is allocated. So it is not required when double buffering is not enabled.
self.nccl_comm.allReduce(self.gpu_allreduce_buffer_a.ptr(), | ||
self.gpu_allreduce_buffer_b.ptr(), n_elems, | ||
_get_nccl_type_id(allreduce_grad_dtype), | ||
nccl.NCCL_SUM, stream.ptr) | ||
if stream != chainer.cuda.Stream.null: | ||
stream.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto why this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because, I added synchronize() at the end of allreduce_grad in DoubleBufferingOptimizer.
https://github.com/chainer/chainermn/pull/216/files#diff-5ac36de863ea63673cdc464693ae1accR131
chainermn/optimizers.py
Outdated
@@ -62,14 +61,10 @@ def __init__(self, actual_optimizer, communicator): | |||
'needs_update', False) | |||
super(_DoubleBufferingOptimizer, self).__setattr__( | |||
'device', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I don't think this is used any more, can be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will remove it.
|
||
def allreduce_grad_async(self, model, stream): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clean up is great, but as this method is only for _DoubleBufferingOptimizer
it should be private method, like named as def _allreduce_grad_async(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will rename it.
Thank you for the comments.
These are correct! |
After an offline discussion, I understand the situation, I think. After this pull request a rough timeline would be like:
This is done with just a single thread and has much fewer synchronization points, only just before allreduce and after division by size to compute model mean. Previous implementation had several more synchronization points, especially a data transfer inside the GPU, from model to allreduce buffer (which is called D2D above) conflicts with training data transfer from host to device before forward (IIRC), or optimize, forward, or backward. The best base would be the transfer starts right after the swap, but the worst case would be right after backward finish. In the worst cast the allreduce starts after backward finished, which have very few overlap of computation and communication. Also the division is moved to background stream, the amount of computation in null stream is reduced in some amount. Thank you for the great work! |
I optimized
PureNcclCommunicator
to accelerate training with double buffering.