This blogpost was delivered in talk form at the recent PASC2019 conference.Slides for that talk arehere.
We’re improving the state of scalable GPU computing in Python.
This post lays out the current status, and describes future work.It also summarizes and links to several other more blogposts from recent months that drill down into different topics for the interested reader.
Broadly we cover briefly the following categories:
Probably the easiest way for a Python programmer to get access to GPUperformance is to use a GPU-accelerated Python library. These provide a set ofcommon operations that are well tuned and integrate well together.
Many users know libraries for deep learning like PyTorch and TensorFlow, butthere are several other for more general purpose computing. These tend to copythe APIs of popular Python projects:
These libraries build GPU accelerated variants of popular Pythonlibraries like NumPy, Pandas, and Scikit-Learn. In order to better understandthe relative performance differencesPeter Entschev recently put together abenchmark suite to help with comparisons.He has produced the following image showing the relative speedup between GPUand CPU:
There are lots of interesting results there.Peter goes into more depth in this in his blogpost.
More broadly though, we see that there is variability in performance.Our mental model for what is fast and slow on the CPU doesn’t neccessarilycarry over to the GPU. Fortunately though, due consistent APIs, users that arefamiliar with Python can easily experiment with GPU acceleration withoutlearning CUDA.
See also this recent blogpost about Numbastencils and the attached GPUnotebook
The built-in operations in GPU libraries like CuPy and RAPIDS cover most commonoperations. However, in real-world settings we often find messy situationsthat require writing a little bit of custom code. Switching down to C/C++/CUDAin these cases can be challenging, especially for users that are primarilyPython developers. This is where Numba can come in.
Python has this same problem on the CPU as well. Users often couldn’t bebothered to learn C/C++ to write fast custom code. To address this there aretools like Cython or Numba, which let Python programmers write fast numericcode without learning much beyond the Python language.
For example, Numba accelerates the for-loop style code below about 500x on theCPU, from slow Python speeds up to fast C/Fortran speeds.
import numba # We added these two lines for a 500x speedup
@numba.jit # We added these two lines for a 500x speedup
def sum(x):
total = 0
for i in range(x.shape[0]):
total += x[i]
return total
The ability to drop down to low-level performant code without context switchingout of Python is useful, particularly if you don’t already know C/C++ orhave a compiler chain set up for you (which is the case for most Python userstoday).
This benefit is even more pronounced on the GPU. While many Python programmersknow a little bit of C, very few of them know CUDA. Even if they did, theywould probably have difficulty in setting up the compiler tools and developmentenvironment.
Enter numba.cuda.jitNumba’s backend for CUDA. Numba.cuda.jit allows Python users to author,compile, and run CUDA code, written in Python, interactively without leaving aPython session. Here is an image of writing a stencil computation thatsmoothes a 2d-image all from within a Jupyter Notebook:
Here is a simplified comparison of Numba CPU/GPU code to compare programmingstyle..The GPU code gets a 200x speed improvement over a single CPU core.
@numba.jit
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0] - 1):
for j in range(1, x.shape[1] - 1):
out[i, j] = x[i + -1, j + -1] + x[i + -1, j + 0] + x[i + -1, j + 1] +
x[i + 0, j + -1] + x[i + 0, j + 0] + x[i + 0, j + 1] +
x[i + 1, j + -1] + x[i + 1, j + 0] + x[i + 1, j + 1]) // 9
return out
or if we use the fancy numba.stencil decorator …
@numba.stencil
def _smooth(x):
return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9
@numba.cuda.jit
def smooth_gpu(x, out):
i, j = cuda.grid(2)
n, m = x.shape
if 1 <= i < n - 1 and 1 <= j < m - 1:
out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
x[i , j - 1] + x[i , j] + x[i , j + 1] +
x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9
Numba.cuda.jit has been out in the wild for years.It’s accessible, mature, and fun to play with.If you have a machine with a GPU in it and some curiositythen we strongly recommend that you try it out.
conda install numba
# or
pip install numba
>>> import numba.cuda
As mentioned in previous blogposts(1,2,3,4)we’ve been generalizing Dask, to operate not just withNumpy arrays and Pandas dataframes, but with anything that looks enough likeNumpy (like CuPy orSparse orJax) or enough like Pandas (like RAPIDScuDF)to scale those libraries out too. This is working out well. Here is a briefvideo showing Dask array computing an SVD in parallel, and seeing what happenswhen we swap out the Numpy library for CuPy.
We see that there is about a 10x speed improvement on the computation. Mostimportantly, we were able to switch between a CPU implementation and a GPUimplementation with a small one-line change, but continue using thesophisticated algorithms with Dask Array, like it’s parallel SVDimplementation.
We also saw a relative slowdown in communication. In general almost allnon-trivial Dask + GPU work today is becoming communication-bound. We’vegotten fast enough at computation that the relative importance of communicationhas grown significantly. We’re working to resolve this with our next topic,UCX.
See this talk by AkshayVenkatesh or view theslides
Also see this recent blogpost about UCX andDask
We’ve been integrating the OpenUCX library into Pythonwith UCX-Py. UCX provides uniform accessto transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py is thefirst time that access to many of these transports has been easily accessiblefrom the Python language.
Using UCX and Dask together we’re able to get significant speedups. Here is atrace of the SVD computation from before both before and after adding UCX:
Before UCX:
After UCX:
There is still a great deal to do here though (the blogpost linked above hasseveral items in the Future Work section).
People can try out UCX and UCX-Py with highly experimental conda packages:
conda create -n ucx -c conda-forge -c jakirkham/label/ucx cudatoolkit=9.2 ucx-proc=*=gpu ucx ucx-py python=3.7
We hope that this work will also affect non-GPU users on HPC systems withInfiniband, or even users on consumer hardware due to the easy access to sharedmemory communication.
In an earlier blogpostwe discussed the challenges around installing the wrong versions of CUDAenabled packages that don’t match the CUDA driver installed on the system.Fortunately due to recent work from Stan Seibertand Michael Sarahan at Anaconda, Conda 4.7 nowhas a special cuda meta-package that is set to the version of the installeddriver. This should make it much easier for users in the future to install thecorrect package.
Conda 4.7 was just releasead, and comes with many new features other than thecuda meta-package. You can read more about it here.
conda update conda
There is still plenty of work to do in the packaging space today.Everyone who builds conda packages does it their own way,resulting in headache and heterogeneity.This is largely due to not having centralized infrastructureto build and test CUDA enabled packages,like we have in Conda Forge.Fortunately, the Conda Forge community is working together with Anaconda andNVIDIA to help resolve this, though that will likely take some time.
This post gave an update of the status of some of the efforts behind GPUcomputing in Python. It also provided a variety of links for future reading.We include them below if you would like to learn more: