Chainer: A flexible framework for neural networks

Chainer/CuPy v7 release and Future of Chainer

Thu, 05 Dec 2019 00:00:00 +0000

Today, we would like to announce two things: the release of Chainer/CuPy v7 and the shift of development efforts for Chainer.

Chainer/CuPy v7

We have released Chainer and CuPy v7.0.0. Changes can be found in the release notes of pre/releases. Here are some notable updates.

Chainer v7 (alpha, beta1, beta2, beta3, beta4, rc1, major):

Most features of Chainer, including ChainerMN, are now compatible with ChainerX ndarray.
ONNX-Chainer is integrated into Chainer.
TabularDataset is added. It is a rich abstraction of columnar datasets with pandas like manipulations.
NHWC support added. Performance for convolutions and batch normalization greatly improved on GPUs with Tensor Core.

CuPy v7 (alpha, beta1, beta2, beta3, beta4, rc1, major):

Support NVIDIA cuTENSOR and CUB for better performance.
Experimental support of ROCm. CuPy now runs on AMD GPUs.

Also note that Python 2 support is dropped as announced. Chainer/CuPy v7 only supports Python 3.5+.

Shift of Development Efforts for Chainer

As announced today, Preferred Networks, the company behind Chainer, is changing its primary framework to PyTorch. We expect that Chainer v7 will be the last major release for Chainer, and further development will be limited to bug-fixes and maintenance. The Chainer family products (ChainerCV, Chainer Chemistry, ChainerUI, and ChainerRL) will also follow this policy.

CuPy will continue its development as before. Although developed as a GPU backend for Chainer, it has been widely adopted by different communities and is relatively unique in accelerating computation with GPUs using NumPy syntax.

Background

This decision has been made after serious considerations based on the mission of the Chainer team: speeding up research and development of deep learning and its applications. With the introduction of Chainer in 2015, we proposed an imperative API set for the differentiable programming paradigm that we named define-by-run. It is now often called eager execution. The define-by-run approach was originally motivated by structured networks for natural language processing such as recurrent neural networks (RNN) and brought advantages to other kinds of networks as well. Its intuitiveness and debuggability helped accelerate the deep learning research development cycle. We believed in the advantages of an imperative execution framework compared to the existing define-and-run declarative approaches. Along the way, we worked on improvements like object-oriented network definition, higher-order differentiation, dynamic inference of layer input size, and training loop abstractions, while keeping the simplicity of the pure Python implementation and interoperability with the NumPy ecosystem.

The define-by-run approach has been widely adopted by the deep learning research community, and the designs of the major frameworks are converging to similar syntax and functionality. We are proud of the role that Chainer has played in this shift and pleased with its contribution to the community. We believe it is the right time to consider what contributions we should make to improve the research productivity of the deep learning community. Instead of separately developing frameworks with similar design goals, we have decided to support a framework with a larger user-base and ecosystem.

After reviewing the available frameworks, we believe PyTorch is the closest in spirit to the Chainer style of code and the appropriate replacement. Preferred Networks will start using PyTorch widely, and we look forward to contributing to PyTorch with the experience and knowledge gained from the development of Chainer.

Conclusion

For users migrating to PyTorch, we are releasing resources to ease porting efforts: Migration Guide and Migration Library.

We would like to thank the contributors to the Chainer code base and the community surrounding it. We wouldn’t be here today without your support over all these years. Let’s continue improving deep learning software to accelerate research and development.

日本語版 (Japanese)

Chainer/CuPy v7のリリースと今後の開発体制について

Thu, 05 Dec 2019 00:00:00 +0000

Chainer/CuPy v7のリリース、およびChainerの開発体制の変更についてお知らせします。

Chainer/CuPy v7

本日、ChainerおよびCuPyのv7.0.0をリリースしました。変更点については各リリースノートをご覧ください。主要な変更点は以下の通りです。

Chainer v7 (alpha, beta1, beta2, beta3, beta4, rc1, major):

ChainerMNを含む多くの機能がChainerXのndarrayに対応しました。
ONNX-ChainerがChainerに統合されました。
TabularDataset が追加されました。カラム指向のデータセットをpandasのような抽象化APIで操作できます。
NHWCのサポートが追加されました。Tensor Coreを搭載したGPUにおいて畳み込みやBatch Normalizationのパフォーマンスが向上します。

CuPy v7 (alpha, beta1, beta2, beta3, beta4, rc1, major):

NVIDIA cuTENSORおよびCUBのサポートによりパフォーマンスが向上しました。
ROCmの試験的なサポートを行いました。これにより、CuPyがAMD GPU上で実行可能になります。

なお、すでにアナウンスした通り、Python 2のサポートが終了しました。Chainer/CuPy v7ではPython 3.5以降のみがサポートされます。

Chainer開発体制の変更について

本日アナウンスされた通り、Chainerの開発元であるPreferred Networksでは、研究開発に使用するフレームワークをPyTorchへ順次移行します。現時点では、Chainer v7はChainerの最後のメジャーリリースとなる予定であり、今後の開発はバグフィックスおよびメンテナンスのみとなります。Chainerファミリー(ChainerCV, Chainer Chemistry, ChainerUI, ChainerRL)についてもこの方針に従います。また、Preferred Networksの運用するディープラーニング入門: Chainerチュートリアルについては今後コンテンツのリニューアルを検討しています。

なお、CuPyの開発はこれまで通り継続してゆきます。CuPyは当初ChainerのGPUバックエンドとして開発されましたが、現在ではGPUによる高速な演算をNumPyと同じ文法で記述できる数少ないライブラリとして、様々なコミュニティに受け入れられています。

背景

この決定は、「深層学習およびその応用の研究開発を高速化する」というChainerチームのミッションを踏まえ、様々な検討を重ねた上で慎重に行われました。

2015年に公開されたChainerは、微分可能プログラミングのための新たな命令的APIセットを提案し、それを define-by-run と名付けました。このパラダイムは、今日では eager executionとも呼ばれています。当初define-by-runのアプローチは、自然言語処理に用いられる回帰型ニューラルネットワーク(RNN)などの記述を容易にするというモチベーションから発案されたものでしたが、すぐにそれ以外のネットワークにも応用されてゆきました。その直感的な表記とデバッグの容易さは、深層学習研究における開発サイクルの高速化に大きく貢献しました。我々は命令的な実行方式を採用するフレームワークが、既存の宣言的な define-and-run 実行方式よりも優れているという確信を得て、開発を進めました。オブジェクト指向によるネットワーク定義、高次微分、レイヤの入力データサイズの動的推論、トレーニングループの抽象化といった様々な機能追加を、pure Pythonによる簡潔な実装とNumPyエコシステムとの相互運用性を保ったまま実現してきました。

define-by-runのアプローチは深層学習コミュニティにおいて広く受け入れられ、結果として多くのフレームワークは似通った文法と機能に集約されてゆきました。Chainerチームは、このトレンドの転換においてChainerが果たした役割を誇りに思うとともに、コミュニティに対してこのような貢献ができたことを嬉しく思います。そして今、研究開発の生産性を高めるために深層学習コミュニティに対してどのような貢献をしてゆくべきか改めて熟慮した結果、似通ったゴールを持つフレームワークを個別に開発するのではなく、より大きなユーザベースとエコシステムを持つフレームワークに貢献してゆくことが最良であると判断しました。

いくつかのフレームワークを検討したのち、PyTorchが最もChainerに近い思想を持っており、Chainerの後続として最適であると確信しました。Preferred Networksでは、今後PyTorchを主要なフレームワークとして使用するとともに、Chainerの開発を通じて得られた知識と経験を生かしてPyTorchへ貢献してゆきます。

おわりに

PyTorchへの移行に際して、Chainerチームでは移行を容易にするためのドキュメントおよびライブラリを公開しました。

これまでChainerおよびChainerを取り巻くコミュニティへ貢献してくださった全ての皆さまに、深く感謝いたします。今日の成果は、皆さまの協力なくして成し得ませんでした。今後も深層学習ソフトウェアの改善を通じて、コミュニティと協働しながら深層学習領域の研究開発の加速に貢献してゆきたいと考えています。

英語版 (English)

Sunsetting Python 2 Support

Wed, 21 Aug 2019 00:00:00 +0000

Summary: Due to the end-of-life (EOL) of Python 2 in January 2020, Chainer and CuPy v7.0.0b3 (release planned in August 2019) will drop Python 2 support. Chainer and CuPy v6.x (current stable release branch) continue to support Python 2. Chainer v6.x will be supported at least until after the EOL of Python 2.

The Chainer Team has decided to drop Python 2 support in Chainer and CuPy (referred to collectively as “Chainer” in this post) v7.x releases. This decision was made considering the following facts:

Python 2 will become end-of-life (EOL) in January 2020.
Many scientific computation packages, including NumPy, which is one of the core dependency of Chainer, are planning or already started to drop support for Python 2.
The results of open-source users’ survey held in the forum (English and Japanese) indicated that only a small ratio of users are currently using Python 2 and most of them are planning to migrate to Python 3.
Supporting Python 2 and 3 in the same codebase requires effort, such as replicating Python 3 features in Python 2, including six in pull-request reviews, etc.

We will sunset Python 2 support on the following schedule:

In Chainer v7.0.0b3 (planned in August 2019), Python 2 will not be supported. It will still run on Python 2, but will give a warning when import chainer / import cupy commands are used in Python 2.
In Chainer v7.0.0b4 (planned in September 2019), Python 2 support will be removed, and the code will not run on Python 2.

Please note that Chainer v6.x (current stable) releases still support Python 2, so you can continue using current and future v6.x releases on Python 2 in your existing projects. Chainer v6.x will be supported at least until after the EOL of Python 2.

Chainer Family Products

ChainerMN (which is already merged to Chainer), ChainerCV, Chainer Chemistry, and ChainerUI will support Python 2 until EOL of Chainer v6.x series.
ChainerRL will drop Python 2 support in the near future (possibly next release, before the Chainer v6.x EOL) as the latest gym package (which ChainerRL depends on) no longer supports Python 2.
ONNX-Chainer and ChainerIO didn’t support Python 2 since the initial release.

Released Chainer/CuPy v6.0.0

Thu, 16 May 2019 00:00:00 +0000

We have released Chainer and CuPy v6.0.0 today! This is a major release that introduces several new features. Full updates can be found in the release notes: Chainer, CuPy.

ChainerX

The biggest update is the introduction of ChainerX. It is a fast and portable ndarray engine with autograd support written in C++ with a very thin Python wrapper.

We have released the beta version of ChainerX in v6.0.0b1 as we wrote in the previous blog post. Since then, we have been working on improving it in various aspects. In particular, ChainerX in v6.0.0 expands the coverage of various features since v6.0.0b1.

Wider op coverage. We have more Chainer functions that directly call ChainerX’s low-overhead implementation. The effort is still on going at the tracking issue with the spreadsheet of op-wise implementation status. We continue to expand the op coverage towards the next v7 release. Contributions are always welcomed!
Wider Function coverage. Most users will start using ChainerX through Chainer’s existing interface (just by replacing NumPy/CuPy arrays with ChainerX arrays). When ChainerX does not have an implementation for an operation, Chainer automatically falls back to NumPy/CuPy-based implementation. It basically works without any fix for most functions, but sometimes not. We are fixing such bugs to enlarge the coverage of functions for ChainerX usage. The effort is accompanied by the introduction of a test fixture class for function tests (you can find the tracking issue). Currently, 40% of the functions under chainer.functions are already tested with ChainerX. They cover basic array operations resembling routines in NumPy and operations commonly used in convolutional neural networks such as convolution, deconvolution and pooling. Operations for recurrent neural networks will be addressed in the upcoming releases. We hope the coverage will reach 100% in v7. Contributions are always welcomed here, too!
Wider example coverage. Most examples now support ChainerX. By specifying ChainerX’s device names (e.g. native for CPU and cuda:0, cuda:1, … for GPUs), examples run with ChainerX arrays. It also means that the coverage of ChainerX support in Chainer’s features in general is expanding.

You can find the previous blog post for its background and overview, and ChainerX Documentation for the installation guide, tutorial, and reference.

Other updates

This release also includes many features other than ChainerX. We list up notable updates as follows.

More mixed precision support. Chainer v6 introduces mixed precision mode and dynamic loss scaling for better support of mixed precision training. Mixed precision mode is enabled by setting CHAINER_DTYPE=mixed16 or chainer.global_config.dtype = chainer.mixed16. In this mode, Chainer automatically chooses either float16 or float32 depending on what is appropriate in terms of a performance-to-precision tradeoff. Dynamic loss scaling, originated from Apex, automatically adjusts the scaling coefficient of backprop to avoid underflow.
Device API. We introduce a new device API for better interoperability between backends (including ChainerX). It unifies the way in which devices are specified and data is transferred between devices. In particular, a unified device specifier is introduced. It is based on ChainerX’s device specifier of the format 'backend:id', e.g. 'native:0' and 'cuda:N' (where N is the CUDA device id). For native (CPU), the id part can be omitted (like 'native'). For conventional devices backed by NumPy-like modules, the name is @numpy, @cupy:N, and @intel64. This notation can be used, e.g., in the to_device function. Note that the existing APIs related to devices (e.g. to_cpu and to_gpu) are still available.
__array_function__ in CuPy. NumPy’s __array_function__ is an experimental feature for letting NumPy dispatch implementations of almost all functions to third-party duck arrays. CuPy now supports this interface. To use this feature, you need to get NumPy 1.16 and set NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1 (it will hopefully be the default mode in NumPy 1.17). Then, many NumPy functions that CuPy supports will accept CuPy arrays and automatically call CuPy’s implementation.

We recommend updating to the latest version of Chainer and CuPy. You can find the upgrade guide here. Updating Chainer should be done as usual with the command pip install -U chainer. Note that ChainerX is not built by default; see the installation guide of ChainerX for details. CuPy can be updated with pip as well, but be careful to use the appropriate package name if you are using a wheel package (cupy-cuda NN).

Any feedback to the dev team would be welcomed and appreciated. You can ask questions or leave comments at gitter, Slack, Google Groups, and StackOverflow.

ChainerX Beta Release

Mon, 03 Dec 2018 00:00:00 +0000

Today, we announce ChainerX, a fast, portable, and extensible backend of Chainer. It is aimed at reducing the host-side performance overhead as well as making models much easier to ship for applications. ChainerX is included as an optional feature of Chainer v6.0.0 beta1, and is planned to be officially released as a part of Chainer v6 series next Spring. You can find the official documentation, including a quick tutorial.

Background

Chainer was developed as a pure Python package, which enabled a simple interface for a Define-by-Run deep learning framework. It heavily depends on NumPy and CuPy, which are both implemented in fast, compiled languages (C and Cython, respectively). Most heavy deep learning tasks work best on NVIDIA GPUs, and, thanks to its asynchronous computing architecture, the framework overhead has been hidden by the sequence of heavy GPU kernel executions. This enabled a deep network system based on Chainer to take the record for the fastest training of a large convolutional networks at the time.

The situation is changing. GPUs are evolving rapidly compared to CPUs, and more and more accelerator chips optimized for deep learning computation are available. As a result, the host side operations are becoming the bottleneck of many tasks, including computer vision, automatic speech recognition, and natural language processing. Some research outcomes are being supplied to application areas, which increases the demands of deploying deep learning models for products and services in reliable and portable ways.

While pure Python is easier to work with and design, it incurs heavy host-side overhead, and dependency on CPython can be an obstacle to porting models to applications. We found that the design of the multi-dimensional array and the define-by-run automatic differentiation is mature, and radical design changes are not expected. ChainerX is designed from scratch as a C++ implementation of these mature components to solve both the performance and the portability issues.

Overview

The “X” suffix of the name stands for three keywords that represent its aim.

Accelerated: It implements an ndarray with autograd feature in C++, removing the host-side overhead related to automatic differentiation.
Exportable: Thanks to the C++ implementation, it opens the door to porting models onto Python-free environments. Note that ChainerX itself does not include any features to actually port the models; yet the pure C++ ndarray implementation with autograd makes it much easier to introduce such a mechanism.
Extensible: As noted above, there are increasing demands of supporting a wider range of computing environments. The new ndarray has a modular design that enables us to plug-in a computing backend that supports new devices.

ChainerX currently covers the ndarray and automatic differentiation part of Chainer. The ndarray and the chainerx namespace follow NumPy-like APIs. The implementation is written in C++, while a thin Python binding layer is provided. We added built-in support of this new ndarray for existing Chainer APIs, including Variable, so that users can immediately start using ChainerX with only slight changes to the user code.

ChainerX provides the following three levels of interfaces.

C++ API (unstable yet): The fastest interface to use if you do not need Python. Backend plugins require this layer to cooperate with.
Python API: A thin wrapper of the C++ API. It follows the NumPy API design, so users familiar with NumPy can quickly learn this API.
Chainer on ChainerX: The existing Chainer API also supports the new ChainerX ndarray similarly to NumPy/CuPy ndarrays. It incurs some overhead, but it is the easiest way to start using ChainerX based on existing code base.

The following is a quick comparison result of host-side overhead.

Framework	Time per iteration (= fwd+bwd+update, msec)
Chainer on NumPy	14.48
Chainer on ChainerX	7.54
ChainerX Python	1.88
PyTorch	2.45

While ChainerX lacks some operation implementations (see the limitations page for more information), Chainer on ChainerX supports automatic fallback to existing NumPy/CuPy based code for forward, backward, and update. Note that this compatibility layer also has some overhead, and we will continue exploring the best way of getting the maximum performance with small code changes.

Future

During the beta phase of Chainer v6, we will add more features to ChainerX so that it can accelerate a wider range of research and applications. We will also continue exploring other ways to adopt the fast Python API of ChainerX in more research projects.

ChainerX is just a half of the picture that covers the whole scenario of fast research and quick model shipping. The Chainer team is also working on translating the models written in Python to a portable format based on ONNX, and running the exported neural net with the ChainerX C++ implementation. We are looking forward to realizing this new research and application cycle, and seeing more and more researchers, practitioners, and engineers play with this evolving framework.

Released Chainer/CuPy v5.0.0

Thu, 25 Oct 2018 00:00:00 +0000

We have released Chainer and CuPy v5.0.0 today! This is a major release that introduces several new features.

The following is a list of selected updates. Full updates can be found in the release notes: Chainer, CuPy.

Static subgraph optimization (experimental). By applying the @static_graph decorator to the static part of your computation (which uses the same graph at every iteration), the computational graph of that part is cached and reused. Fully-static models speed up by 20-60% in most cases. Example code modified for the static subgraph feature can be found here.
Float16 support. Using half-precision floats is made much easier! Since recent GPU technologies often focus on half and mixed precision computations, using float16 is crucial for fully utilizing the latest hardware performance. In Chainer v5, the default floating point dtype is configurable via CHAINER_DTYPE environment variable or config.dtype entry. Using this feature, most code will be able to use float16 without modification. Many classes and functions are fixed to support float16 inputs and parameters.
ChainerMN integration. ChainerMN was an add-on package of Chainer for distributed deep learning, but is now a built-in module of Chainer v5. The APIs and the usage are not changed; just install chainer and mpi4py to start distributed deep learning.
Probability distributions. We introduced the chainer.distributions module that implements many parametric probability distributions with autograd capability. Each distribution provides point-wise evaluation (e.g. log density), statistics computation, and sampling. For its implementation, we also added many GPU sampling routines (under cupy.random) and special functions (e.g. log-gamma function). While v5 includes many frequently used distributions, we are still expanding this feature for the upcoming releases.
iDeep 2.0. Chainer Backend for Intel Architecture, a.k.a. iDeep, is updated. You can install it with pip install ideep4py, and use it by setting the environment variable CHAINER_USE_IDEEP=auto. There are many performance improvements in this version.
CuPy interoperability with other libraries and ecosystems. CuPy ndarray can now be easily combined with other libraries. For more details, see the Interoperability section of the CuPy reference manual.
- DLpack: ndarray.toDLpack and cupy.fromDLpack can be used to interchange the array with other deep learning frameworks.
- NumPy: NumPy ufunc is directly applicable to CuPy’s ndarray. For example, numpy.exp(cupy.arange(3)) is valid, which is equivalent to cupy.exp(cupy.arange(3)).
- Numba: Numba’s JITed CUDA kernel is directly applicable to CuPy ndarrays.

We recommend updating to the latest version of Chainer and CuPy. You can find the upgrade guide here. Updating Chainer should be done as usual with the command pip install -U chainer. CuPy can be updated in the same way, but be careful to use the appropriate package name if you are using a wheel package (cupy-cuda NN).

Any feedback to the dev team would be welcomed and appreciated. You can ask questions or leave comments at gitter, Slack, Google Groups, and StackOverflow.

ChainerMN on AWS with CloudFormation

Fri, 01 Jun 2018 00:00:00 +0000

Japanese version is here

AWS CloudFormation a service which helps us to practice Infrastructure As Code on wide varieties of AWS resources. AWS CloudFormation provisions AWS resources in a repeatable manner and allows us to build and re-build infrastructure without time-consuming manual actions or write custom scripts.

Building distributed deep learning infrastructure requires some extra hustle such as installing and configuring deep learning libraries, setup ec2 instances, and optimization for computational/network performance. Particularly, running ChainerMN requires you to setup an MPI cluster. AWS CloudFormation helps us automating this process.

Today, We announce Chainer/ChainerMN pre-installed AMI and CloudFormaiton template for ChainerMN Cluster.

This enables us to spin up a ChainerMN cluster on AWS and run your ChainerMN tasks instantly in the cluster.

This article explains how to use them and how you can run distributed deep learning with ChainerMN on AWS.

Chainer AMI

The Chainer AMI comes with Chainer/CuPy/ChainerMN, its families (ChianerCV and ChainerRL) and CUDA-aware OpenMPI libraries so that you can run Chainer/ChainerMN workloads easily on AWS EC2 instances even on ones with GPUs. This image is based on AWS Deep Learning Base AMI.

The latest version is 0.1.0. The version includes:

OpenMPI version 2.1.3
- it was built only for cuda-9.0.
All Chainer Families (they are built and installed against both python and python3 environment)
- CuPy version 4.1.0
- Chainer version 4.1.0,
- ChainerMN, version 1.3.0
- ChainerCV version 0.9.0
- ChainerRL version 0.3.0

CloudFormation Template For ChainerMN

This template automatically sets up a ChainerMN cluster on AWS. Here’s the setup overview for AWS resources:

VPC and Subnet for the cluster (you can configure existing VPC/Subnet)
S3 Bucket for sharing ephemeral ssh-key, which is used to communicate among MPI processes in the cluster
Placement group for optimizing network performance
ChainerMN cluster which consists of:
- 1 master EC2 instance
- N (>=0) worker instances (via AutoScalingGroup)
- chainer user to run mpi job in each instance
- hostfile to run mpi job in each instance
(Option) Amazon Elastic Filesystem (you can configure an existing filesystem)
- This is mounted on cluster instances automatically to share your code and data.
Several required SecurityGroups, IAM Role

The latest version is 0.1.0. Please see the latest template for detailed resource definitions.

As stated on our recent blog on ChainerMN 1.3.0, using new features (double buffering and all-reduce in half-precision floats) enables almost linear scalability on AWS even at ethernet speeds.

How to build a ChainerMN Cluster with the CloudFormation Template

This section explains how to setup ChainerMN cluster on AWS in a step-by-step manner.

First, please click the link below to create AWS CloudFormation Stack. And just click ‘Next’ on the page.

In “Specify Details” page, you can configure parameters on stack name, VPC/Subnet, Cluster, EFS configurations. The screenshot below is an example for configuring 4 p3.16xlarge instances, each of which has 8 NVIDIA Tesla V100 GPUs.

At the last confirmation page, you will need to check a box in CAPABILITY section because this template will create some IAM roles for cluster instances.

After several minutes (depending on cluster size), the status of the stack should converge to CREATE_COMPLETE if all went well, meaning your cluster is ready. You can access the cluster with ClusterMasterPublicDNS which will appear in the output section of the stack.

How to run ChainerMN Job in the Cluster

You can access the cluster instances with keypair which was specified in template parameter.

ssh -i keypair.pem [email protected]

Because Chainer AMI comes with all required libraries to run Chainer/ChainerMN jobs, you only need to download your code to the instances.

# switch user to chainer
ubuntu@ip-ww-xxx-yy-zzz$ sudo su chainer

# download ChainerMN's train_mnist.py into EFS
chainer@ip-ww-xxx-yy-zzz$ wget https://raw.githubusercontent.com/chainer/chainermn/v1.3.0/examples/mnist/train_mnist.py -O /efs/train_mnist.py

That’s it! Now, you can run MNIST example with ChainerMN by just invoking mpiexec command.

# It will spawn 32 processes(-n option) among 4 instances (8 processes per instance (-N option))
chainer@ip-ww-xxx-yy-zzz$ mpiexec -n 32 -N 8 python /efs/train_mnist.py -g
...(you will see ssh warning here)
==========================================
Num process (COMM_WORLD): 32
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.795527    0.316611              0.765263       0.907536                  4.47915
...
19          0.00540187  0.0658256             0.999474       0.979351                  14.7716
20          0.00463723  0.0668939             0.998889       0.978882                  15.2248

# NOTE: above output is actually the output of the second try because mnist dataset download is needed in the first try.

Open source deep learning framework Chainer officially supported by Amazon Web Services

Fri, 01 Jun 2018 00:00:00 +0000

Chainer has worked with Amazon Web Services (AWS) to provide access to the Chainer deep learning framework as a listed choice across many of AWS applications. Chainer provides straightforward calculation of deep neural networks in Python. The combination with AWS leverages Chainer’s exceptional abilities in multi-GPU and multi-server scaling, as demonstrated when PFN trained ResNet50 on ImageNet-1K using Chainer in 15 minutes, four times faster than the previous record held by Facebook.

Usage of multi-GPU and multi-server scaling allows researchers to leverage the ability of the cloud to provide computing resources on demand. Chainer’s unparalleled ability for parallel computing combined with AWS cloud resources available on demand enables researchers and engineers to minimize their cost while training complex deep learning models in a fraction of the time required on more limited hardware.

Chainer is already available as part of the AWS Deep Learning Amazon Machine Image (AMI). This is further enhanced by Chainer’s recent release of a CloudFormation script, which enables easy deployment of multiple Chainer AMIs at a time. Chainer has been tested to provide 95% scaling efficiency up to 32 GPUs on AWS, which means training of a neural network can be done up to thirty times as fast.

To simplify the process of pre-processing data, tuning hyperparameters, and deploying a neural network, Chainer is now supported on Amazon SageMaker. Amazon SageMaker is a fully-managed platform that enables developers and data scientists to quickly and easily build, train, and deploy machine learning models at any scale. Using Chainer on Sagemaker will provide speed increases from parallelization, in addition to the deployment benefits of SageMaker.

In an additional announcement, AWS now supports Chainer on AWS Greengrass, the AWS service that lets you run local compute, messaging, data caching, sync, and ML inference capabilities for connected devices in a secure way. Combined with Amazon SageMaker, this allows access to the ease and speed of Chainer when training models on SageMaker and direct deployment on AWS Greengrass to IoT devices.

The Chainer team is excited about these releases by AWS and looks forward to providing further advances as deep learning techniques continue to advance.

New ChainerMN functions for improved performance in cloud environments and performance testing results on AWS

Fri, 25 May 2018 00:00:00 +0000

ChainerMN is a package that adds multi-node distributed learning functionality to Chainer. We have added the following two new functions to v1.2.0 and v1.3.0 of ChainerMN, which are intended to improve the performance on systems whose inter-node communication bandwidth is low.

Double buffering to conceal communication time
All-Reduce function in half-precision floats (FP16)

It had previously been difficult to achieve high parallel performance in a system environment without a high-speed network because ChainerMN was developed assuming a supercomputer-like system with a high-speed network. With these newly-added functions, ChainerMN will be able to achieve high parallel performance even in the cloud and other common systems such as Amazon Web Services (AWS) as we presented at GTC2018.

Background

In data parallel distributed deep learning, the training time is typically dominated by the All-Reduce operation to calculate the sum gradients computed per node. We solved this issue on PFN’s 1,024 GPU supercomputer by utilizing high-speed InfiniBand interconnect, which is also used in supercomputers and Microsoft Azure, and also using the NVIDIA Collective Communications Library (NCCL) that enables fast execution of All-Reduce functions [1]. However, AWS and other commonly used systems have larger communication overhead because they do not have such a high-speed interconnect as InfiniBand. As a result of this, we could not make training faster simply by increasing the number of nodes in some cases. To solve these issues, we have added two function to ChainerMN v1.2.0 and v1.3.0: a double buffering function to conceal communication time and an All-Reduce function in FP16.

Function to conceal communication time by double buffering

This function conceals the time it takes to communicate and shortens the overall computation time by having computation (forward, backward, and optimize) and communication (All-Reduce) processes overlapped. Normally in ChainerMN, one iteration consists of the four steps in the below diagram: forward, backward, All-Reduce, and optimize.

Using double buffering to conceal the communication time, the calculation and communication processes can overlap as in the below diagram.

In this case, the optimization process is performed using gradients in the previous iteration. This means it uses old gradients to optimize the model, possibly affecting accuracy. We have learned, however, that almost the same level of accuracy can be maintained when training on ImageNet as demonstrated in the experiment described later in this article.

You can use this function just by making double_buffering=True when creating a multi-node optimizer as shown below.

optimizer = chainermn.create_multi_node_optimizer(optimizer, comm, double_buffering=True)

Currently, this function only supports the pure_nccl communicator.

All-Reduce function in FP16

ChainerMN v1.2.0 only supported All-Reduce in FP32 but v1.3.0 supports FP16 as well. This allows you to perform distributed training even for FP16 models using ChainerMN. We can expect a significant reduction in All-Reduce time by using FP16 because the communication volume is halved in comparison with using FP32. In addition, now you can use FP16 for only All-Reduce and reduce the All-Reduce time, even if you used FP32 in computation. This is the technique we employed for training on ImageNet using 1,024 GPUs [1].

For FP16 models, All-Reduce is carried out in FP16 without making any change. You can use different data types for computation and All-Reduce by putting allreduce_grad_dtype='float16' when creating a communicator as shown below.

comm = chainermn.create_communicator('pure_nccl', allreduce_grad_dtype='float16')

This function only supports the pure_nccl communicator as of today, as double buffering does likewise.

Results

To demonstrate high parallel performance using the two new functions, we measured performance using image classification datasets on ImageNet. We used ResNet-50 as the CNN model. In this experiment, we used 10Gb Ethernet of PFN’s supercomputer MN-1 and AWS as low-speed networks. For more details on the experiment setting, please refer to the appendix at the end of this article.

Evaluation using 10Gb Ethernet

The following graphs show changes in the throughput as the number of GPUs is increased in three cases using MN-1: Infiniband FDR, 10Gb Ethernet, and 10Gb Ethernet using the two new functions.

As you can see in the figure, the performance did not improve even as we increased the number of GPUs when using the 10Gb Ethernet while the use of the new functions enabled it to achieve the ideal speedup with the performance scaling linearly with the number of GPUs.

The following table also shows the average validation accuracy and average training hours when conducting training for five times with the number of epochs = 90 and 32 GPUs.

	Validation Accuracy (%)	ComputingTime (hour)
InfiniBand FDR	76.4	6.96
10 Gb Ethernet	76.4	21.3
10 Gb Ethernet + Double Buffering + FP16 Allreduce	75.8	7.71

As you can see, the two new functions had almost no impact on accuracy. In the meantime, it just took 11% longer to train the model when using the 10 Gb Ethernet and the new functions than when using Infiniband FDR. With this, we can conclude that high parallel performance can be achieved while maintaining the level of accuracy, without a need to use Infiniband or other high-speed networks.

Evaluation using AWS

In testing with AWS, we used p3.16xlarge. This instance has eight V100, which is the highest-performance GPU available as of May 2018. The following graphs show changes in the throughput as the number of GPUs increased when using this instance.

Scaling efficiency is an indicator often used to measure parallel performance. In this experiment, the scaling efficiency is expressed as \(e\) using the following equation where the base throughput is \(t_0\) and the throughput when \(n\) x base GPUs are used is \(t\).

\[e = t/(t_0*n)\]

It indicates that the closer \(e\) gets to 1 (100%), the higher the parallel performance is. In this experiment, the scaling efficiency was 96% at 32GPUs when using 8GPUs as the base, demonstrating that the high parallel performance has been achieved by using the new functions.

Outlook for the future

We plan to add more functions to ChainerMN, including model parallelism to support various training models that are not achievable by data parallel as well as a function to improve fault tolerance. Our team is not only developing ChainerMN but also putting efforts into making Chainer and CuPy faster, and doing large-scale research and development activities by making the full use of MN-1, which is equipped with 1,024 P100 units, and the next-generation cluster with 512 V100 units. If you are interested in working with us on these activities, send us your application!

Appendix

Performance measurement details

Experiment setup

Dataset：ImageNet-1k
Model：ResNet-50 （input image size 224×224）

Setup for measuring throughputs

Batch size：64
Training rate：fixed
Data augmentation：using the same method as Goyal et al. [2]
Optimization：Momentum SGD (momentum=0.9)
Weight decay: 0.0001
# of measurements：400 iterations

Setting for training with # of epochs = 90

Batch size：64 per GPU until the 30th epoch, 128 afterwards
Training rate：Gradual warmup until the 5th epoch, 0.2 time at the 30th epoch, and 0.1 time at the 60th and 80th epochs.
Data augmentation：using the same method as Goyal et al. [2]
Optimization：Momentum SGD (momentum=0.9)
Weight decay: 0.0001
# of epochs：90 epochs
In general, this setup is based on Goyal et al. [2] and uses the technique described in Smith et al. [3].

Experiment conditions in the verification test using 10Gb Ethernet

Max 4 nodes, 32 GPUs in total
Node
- GPU: 8 * NVIDIA Tesla P100 GPUs
- CPU: 2 * Intel Xeon E5-2667 processors (3.20 GHz, 8 cores)
- Network: InfiniBand FDR
- Save location for training data：local disk

Experiment conditions in the verification test using AWS

Max 4 nodes, 32 GPUs in total
Node（p3.16xlarge）
- GPU: 8 * NVIDIA Tesla V100 GPUs
- CPU: 64 vCPUs
- Network: 25 Gbps network
- Save location for training data：RAM disk

References

[1] Akiba, T., et al. Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes. CoRR, abs/1711.04325, 2017.

[2] Goyal, P., et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR, abs/1706.02677, 2017.

[3] Smith, S. L., et al. Don’t Decay the Learning Rate, Increase the Batch Size. CoRR, abs/1711.00489, 2017.

ChainerMN on Kubernetes with GPUs

Thu, 10 May 2018 00:00:00 +0000

Kubernetes is today the most popular open-source system for automating deployment, scaling, and management of containerized applications. As the rise of Kubernetes, bunch of companies are running Kubernetes as a platform for various workloads including web applications, databases, cronjobs and so on. Machine Learning workloads, including Deep Learning workloads, are not an exception even though such workloads require special hardwares like GPUs.

Kubernetes can schedule NVIDIA GPUs by default. So, single node Chainer workloads are straightforward. You can simply launch a Pod or a Job with nvidia.com/gpu resource request.

However running ChainerMN on Kubernetes is not straightforward because it requires us to setup an MPI cluster. Kubeflow can be a big help for it. The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Please refer to helpful two slides below about Kubeflow which were presented on KubeCon + CloudNativeCon Europe 2018.

In this article, I would like to explain how to run ChainerMN workloads on Kubernetes with the help of Kubeflow.

How to run ChainerMN on Kubernetes

I explain it in three steps below:

Step 1. Build Your Container Image
Step 2. Install Kubeflow’s OpenMPI package
Step 3. Run ChainerMN on Kubernetes

Prerequisites

Kubernetes cluster equipped with Nvidia GPUs
on your local machine
- docker
- kubectl
- ksonnnet

Step 1. Build Your Container Image

First we need to build a container image to run your deep learning workload with ChainerMN. All we can just follow the official ChainerMN installation guides.

For Chainer/Cupy, official docker image chainer/chainer is available on DockerHub. This is very handy as a base image or runtime image for deep learning workloads because this image is already nvidia-docker ready.

Below is a sample Dockerfile to install CUDA aware OpenMPI, ChainerMN and its sample train_mnist.py script. Please save the contents with the name Dockerfile.

FROM chainer/chainer:v4.0.0-python3

ARG OPENMPI_VERSION="2.1.3"
ARG CHAINER_MN_VERSION="1.2.0"

# Install basic dependencies and locales
RUN apt-get update && apt-get install -yq --no-install-recommends \
      locales wget sudo ca-certificates ssh build-essential && \
    rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/* && \
    echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && locale-gen

# Install OpenMPI with cuda
RUN cd /tmp && \
  wget -q https://www.open-mpi.org/software/ompi/v${OPENMPI_VERSION%\.*}/downloads/openmpi-$OPENMPI_VERSION.tar.bz2 && \
  tar -xjf openmpi-$OPENMPI_VERSION.tar.bz2 && \
  cd /tmp/openmpi-$OPENMPI_VERSION && \
  ./configure --prefix=/usr --with-cuda && make -j2 && make install && rm -r /tmp/openmpi-$OPENMPI_VERSION* && \
  ompi_info --parsable --all | grep -q "mpi_built_with_cuda_support:value:true"

# Install ChainerMN
RUN pip3 install chainermn==$CHAINER_MN_VERSION

# Download train_mnist.py example of ChainerMN
# In practice, you would download your codes here.
RUN mkdir -p /chainermn-examples/mnist && \
  cd /chainermn-examples/mnist && \
  wget https://raw.githubusercontent.com/chainer/chainermn/v${CHAINER_MN_VERSION}/examples/mnist/train_mnist.py

Then, you are ready to build and publish your container image.

# This takes some time (probably 10-15 min.). please enjoy ☕️.
docker build . -t YOUR_IMAGE_HERE
docker publish YOUR_IMAGE_HERE

Step 2. Install Kubeflow’s OpenMPI package

Kubeflow’s OpenMPI package in Kubeflow enables us launch OpenMPI cluster on Kubernetes very easily.

Actually, Kubeflow’s OpenMPI package have not been released officially. But it has been already available in master branch of Kubeflow repository. So, Let’s use it. Please note that this package is still in development mode.

Kubeflow depends on ksonnet. If you’re not faimiliar with ksonnet, I recommend you to follow their official tutorial.

Steps are very similar as discribed in Kubeflow’s OpenMPI package. I modified the original steps slightly because we have to use a specific commit of Kubeflow repository.

NOTE: If you faced rate limit errors of github api, please set up GITHUB_TOKEN as described here.

# Create a namespace for kubeflow deployment.
NAMESPACE=kubeflow
kubectl create namespace ${NAMESPACE}

# Generate one-time ssh keys used by Open MPI.
SECRET=openmpi-secret
mkdir -p .tmp
yes | ssh-keygen -N "" -f .tmp/id_rsa
kubectl delete secret ${SECRET} -n ${NAMESPACE} || true
kubectl create secret generic ${SECRET} -n ${NAMESPACE} --from-file=id_rsa=.tmp/id_rsa --from-file=id_rsa.pub=.tmp/id_rsa.pub --from-file=authorized_keys=.tmp/id_rsa.pub

# Which version of Kubeflow to use.
# For a list of releases refer to:
# https://github.com/kubeflow/kubeflow/releases
# (Specific commit hash is specified here.)
VERSION=e2fbf9e25e087eeb6ee1f9414526c6ed917c4bf9

# Initialize a ksonnet app. Set the namespace for it's default environment.
APP_NAME=chainermn-example
ks init ${APP_NAME}
cd ${APP_NAME}
ks env set default --namespace ${NAMESPACE}

# Install Kubeflow components.
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow/openmpi@${VERSION}

Step 3. Run ChainerMN!

Now ready to run distributed train_mnist.py! According to standard ksonnet way, we firstly generate train_mnist component from openmpi prototype.

When generating a component, we can specify several parameters. In this example, we specify

train-mnist for its name,
4 workers,
1 GPU for each worker, and
launching mpiexec ... train_mnist.py scirpt for exec param

And then, ks apply command deploy our OpenMPI cluster on Kubernetes cluster.

Please be advised that this step requires an authorization to create service accounts and cluster role bindings for “view” cluster role. If you didn’t have such authorization, you will have to ask your administrator to create a service account which is granted ‘get’ verb for ‘pods’ resources. If such service account was ready, you then will set it to serviceAccountName param of train-mnist component.

# See the list of supported parameters.
ks prototype describe openmpi

# Generate openmpi components.
COMPONENT=train-mnist
IMAGE=YOUR_IMAGE_HERE
WORKERS=4
GPU=1
EXEC="mpiexec -n ${WORKERS} --hostfile /kubeflow/openmpi/assets/hostfile --allow-run-as-root --display-map -- python3 /chainermn-examples/mnist/train_mnist.py -g"
ks generate openmpi ${COMPONENT} --image ${IMAGE} --secret ${SECRET} --workers ${WORKERS} --gpu ${GPU} --exec "${EXEC}"

# Deploy to your cluster.
ks apply default

# Clean up, execute below two commands
# ks delete default
# kubectl delete secret ${SECRET}

This launches 1 master pod and 4 worker pods and some supplemental parts. Once train-mnist-master pod became Running state, training logs will be seen.

# Inspect pods status
# Wait until all pods are 'Running'
kubectl get pod -n ${NAMESPACE} -o wide

If all went good, our job progress will be seen on your terminal with kubectl logs!! It will show our deep learning jobs are distributed across 4 workers!

# Inspect training logs
kubectl logs -n ${NAMESPACE} -f ${COMPONENT}-master

This will show you training logs (I omitted several warning messages you can ignore)!!

...
========================   JOB MAP   ========================

Data for node: train-mnist-worker-0.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
       Process OMPI jobid: [13015,1] App: 0 Process rank: 0 Bound: N/A

Data for node: train-mnist-worker-1.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
       Process OMPI jobid: [13015,1] App: 0 Process rank: 1 Bound: N/A

Data for node: train-mnist-worker-2.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
       Process OMPI jobid: [13015,1] App: 0 Process rank: 2 Bound: N/A

Data for node: train-mnist-worker-3.train-mnist.kubeflow Num slots: 16   Max slots: 0    Num procs: 1
       Process OMPI jobid: [13015,1] App: 0 Process rank: 3 Bound: N/A

=============================================================
==========================================
Num process (COMM_WORLD): 4
Using GPUs
Using hierarchical communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
1           0.285947    0.106961              0.917333       0.9681                    16.6241
2           0.0870434   0.0882483             0.9736         0.9708                    23.0874
3           0.050553    0.0709311             0.9842         0.9781                    28.6014
...