Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nccl in CuPy #181

Merged
merged 6 commits into from
Jan 12, 2018
Merged

Use nccl in CuPy #181

merged 6 commits into from
Jan 12, 2018

Conversation

shu65
Copy link
Member

@shu65 shu65 commented Dec 27, 2017

Some errors may occur when NCCL versions used for ChainerMN and CuPy are different. So I modify ChainerMN to use nccl in Cupy and remove NCCL wrapper in ChainerMN.

@shu65 shu65 changed the title Use nccl in CuPy [WIP] Use nccl in CuPy Dec 27, 2017
@shu65 shu65 changed the title [WIP] Use nccl in CuPy Use nccl in CuPy Dec 28, 2017
with the ``--no-nccl`` flag.::

$ python setup.py install --no-nccl
typically for testing for debugging purpose, ChainerMN can be used by not installing CuPy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole sentence would be like this:

Users who want to try ChainerMN in CPU-only environment may skip installation of CuPy. Non-GPU set up may not be performant as GPU-enabled set up, but would would be useful for testing or debugging training program in non-GPU environment such as laptops or CI jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. I will fix it.


.. note::

To try current version of ChainerMN in a CPU-only enviroment, you not require ``--no-nccl`` flag,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current version of ChainerMN does not need --no-nccl flag for CPU-only environment at installation any more. It would be just ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it.

@kuenishi
Copy link
Member

kuenishi commented Dec 28, 2017

I think CuPy must be properly installed with NCCL enabled. I'd recommend adding a sentence how to check CuPy installation, before ChainerMN installation like python -c 'from cupy import nccl' . Probably add after --no-nccl note?

@kuenishi
Copy link
Member

Also we'd need update on a nccl check at _base.py :

        if use_nccl and not nccl._available:
            raise RuntimeError(
                'NCCL is not available. '
                'Please confirm that NCCL can be found by dynamic linkers, '
                'and ChainerMN is installed without --no-nccl flag.'
            )

@kuenishi kuenishi self-requested a review January 10, 2018 09:17
Copy link
Member

@kuenishi kuenishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mainly on English, while the code seems nice.

@kuenishi
Copy link
Member

kuenishi commented Jan 11, 2018

Changes are good; while I tested again and unfortunately found potential fix for non-nccl checking:

>>> import chainermn
>>> c = chainermn.create_communicator('pure_nccl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kuenishi/src/chainermn/chainermn/communicators/__init__.py", line 77, in create_communicator
    return PureNcclCommunicator(mpi_comm=mpi_comm)
  File "/home/kuenishi/src/chainermn/chainermn/communicators/pure_nccl_communicator.py", line 12, in __init__
    if nccl.get_version() < 2000:
AttributeError: module 'chainermn.nccl' has no attribute 'get_version'
>>> 

This is because of nccl version checking nccl.get_version coming before initializing CommunicatorBase initialization which includes nccl check. If we keep supporting NCCL 1.x support in other communicators than PureNcclCommunicator I'd recommend putting version check later than CommunicatorBase init. Otherwise moving version check to CommunicatorBase init would make sense more.

@kuenishi kuenishi added this to the 1.2.0 milestone Jan 11, 2018
@shu65
Copy link
Member Author

shu65 commented Jan 11, 2018

Thank you for testing. I confirmed this error.
Because the first idea seems to be better, I will adopt it.

@kuenishi kuenishi merged commit 631a0c4 into master Jan 12, 2018
@kuenishi kuenishi deleted the use_nccl_in_cupy branch January 12, 2018 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants