In recent years Python’s array computing ecosystem has grown organically to supportGPUs, sparse, and distributed arrays.This is wonderful and a great example of the growth that can occur in decentralized open source development.
However to solidify this growth and apply it across the ecosystem we now need to do some central planningto move from a pair-wise model where packages need to know about each otherto an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.
With moderate effort we can define a subset of the Numpy API that works well across all of them,allowing the ecosystem to more smoothly transition between hardware.This post describes the opportunities and challenges to accomplish this.
We start by discussing two kinds of libraries:
The Numpy array is one of the foundations of the numeric Python ecosystem,and serves as the standard model for similar libraries in other languages.Today it is used to analyze satellite and biomedical imagery, financial models,genomes, oceans and the atmosphere, super-computer simulations,and data from thousands of other domains.
However, Numpy was designed several years ago,and its implementation is no longer optimal for some modern hardware,particularly multi-core workstations, many-core GPUs, and distributed clusters.
Fortunately other libraries implement the Numpy array API on these other architectures:
So even when the Numpy implementation is no longer ideal,the Numpy API lives on in successor projects.
Note: the Numpy implementation remains ideal most of the time.Dense in-memory arrays are still the common case.This blogpost is about the minority of cases where Numpy is not ideal
So today we can write code similar code between all ofNumpy, GPU, sparse, and parallel arrays:
import numpy as np
x = np.random.random(...) # Runs on a single CPU
y = x.T.dot(np.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5])
import cupy as cp
x = cp.random.random(...) # Runs on a GPU
y = x.T.dot(cp.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].get())
import dask.array as da
x = da.random.random(...) # Runs on many CPUs
y = x.T.dot(da.log(x) + 1)
z = y - y.mean(axis=0)
print(z[:5].compute())
...
Additionally, each of the deep learning frameworks(TensorFlow, PyTorch, MXNet)has a Numpy-like thing that is similar-ish to Numpy’s API,but definitely not trying to be an exact match.
At the same time as the development of Numpy APIs for different hardware,many libraries today build algorithmic functionality on top of the Numpy API:
These projects and more enhance array computing in Python,building on new features beyond what Numpy itself provides.
There are also projects like Pandas, Scikit-Learn, and SciPy,that use Numpy’s in-memory internal representation.We’re going to ignore these libraries for this blogpostand focus on those libraries that only use the high-level Numpy APIand not the low-level representation.
Given the two groups of projects:
We want to use them together, applying Autograd to CuPy, TensorLy to Sparse,and so on, including all future implementations that might follow.This is challenging.
Unfortunately,while all of the array implementations APIs are very similar to Numpy’s API,they use different functions.
>>> numpy.sin is cupy.sin
False
This creates problems for the consumer libraries,because now they need to switch out which functions they usedepending on which array-like objects they’ve been given.
def f(x):
if isinstance(x, numpy.ndarray):
return np.sin(x)
elif isinstance(x, cupy.ndarray):
return cupy.sin(x)
elif ...
Today each array project implements a custom plugin systemthat they use to switch between some of the array options.Links to these plugin mechanisms are below if you’re interested:
For example XArray can use either Numpy arrays or Dask arrays.This has been hugely beneficial to users of that project,which today seamlessly transition from small in-memory datasets on their laptopsto 100TB datasets on clusters,all using the same programming model.However when considering adding sparse or GPU arrays to XArray’s plugin system,it quickly became clear that this would be expensive today.
Building, maintaining, and extending these plugin mechanisms is costly.The plugin systems in each project are not alike,so any new array implementation has to go to each library and build the same code several times.Similarly, any new algorithmic library must build plugins to every ndarray implementation.Each library has to explicitly import and understand each other library,and has to adapt as those libraries change over time.This coverage is not complete,and so users lack confidence that their applications are portable between hardware.
Pair-wise plugin mechanisms make sense for a single project,but are not an efficient choice for the full ecosystem.
I see two solutions today:
Each has challenges.
We can build a new library,here called arrayish,that holds dispatch-able versions of all of the relevant Numpy functions.We then convince everyone to use it instead of Numpy internally.
So in each array-like library’s codebase we write code like the following:
# inside numpy's codebase
import arrayish
import numpy
@arrayish.sin.register(numpy.ndarray, numpy.sin)
@arrayish.cos.register(numpy.ndarray, numpy.cos)
@arrayish.dot.register(numpy.ndarray, numpy.ndarray, numpy.dot)
...
# inside cupy's codebase
import arrayish
import cupy
@arrayish.sin.register(cupy.ndarray, cupy.sin)
@arrayish.cos.register(cupy.ndarray, cupy.cos)
@arrayish.dot.register(cupy.ndarray, cupy.ndarray, cupy.dot)
...
and so on for Dask, Sparse, and any other Numpy-like libraries.
In all of the algorithm libraries (like XArray, autograd, TensorLy, …)we use arrayish instead of Numpy
# inside XArray's codebase
# import numpy
import arrayish as numpy
This is the same plugin solution as before,but now we build a community standard plugin systemthat hopefully all of the projects can agree to use.
This reduces the big n by m cost of maintaining several plugin systems,to a more manageable n plus m cost of using a single plugin system in each library.This centralized project would also benefit, perhaps,from being better maintained than any individual project is likely to do on its own.
However this has costs:
Hameer Abbasiput together a rudimentary prototype for arrayish here:github.com/hameerabbasi/arrayish.There has been some discussion about this topic, using XArray+Sparse as an example, inpydata/sparse #1
Alternatively, the central dispatching mechanism could live within Numpy itself.
Numpy functions could learn to hand control over to their arguments,allowing the array implementations to take over when possible.This would allow existing Numpy code to work on externally developed array implementations.
There is precedent for this.The array_ufunc protocolallows any class that defines the __array_ufunc__ methodto take control of any Numpy ufunc like np.sin or np.exp.Numpy reductions like np.sum already look for .sum methods on their arguments and defer to them if possible.
Some array projects, like Dask and Sparse, already implement the __array_ufunc__ protocol.There is also an open PR for CuPy.Here is an example showing Numpy functions on Dask arrays cleanly.
>>> import numpy as np
>>> import dask.array as da
>>> x = da.ones(10, chunks=(5,)) # A Dask array
>>> np.sum(np.exp(x)) # Apply Numpy function to a Dask array
dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()> # get a Dask array
I recommend that all Numpy-API compatible array projects implement the __array_ufunc__ protocol.
This works for many functions, but not all.Other operations like tensordot, concatenate, and stackoccur frequently in algorithmic code but are not covered here.
This solution avoids the community challenges of the arrayish solution above.Everyone is accustomed to aligning themselves to Numpy’s decisions,and relatively little code would need to be rewritten.
The challenge with this approach is that historicallyNumpy has moved more slowly than the rest of the ecosystem.For example the __array_ufunc__ protocol mentioned abovewas discussed for several years before it was merged.Fortunately Numpy has recentlyreceivedfundingto help it make changes like this more rapidly.The full time developers hired under this funding have just started though,and it’s not clear how much of a priority this work is for them at first.
For what it’s worth I’d prefer to see this Numpy protocol solution take hold.
In recent years Python’s array computing ecosystem has grown organically to supportGPUs, sparse, and distributed arrays.This is wonderful and a great example of the growth that can occur in decentralized open source development.
However to solidify this growth and apply it across the ecosystem we now need to do some central planningto move from a pair-wise model where packages need to know about each otherto an ecosystem model where packages can negotiate by developing and adhering to community-standard protocols.
The community has done this transition before(Numeric + Numarray -> Numpy, the Scikit-Learn fit/predict API, etc..)usually with surprisingly positive results.
The open questions I have today are the following:
After discussing this topic at theMay NumPy Developer Sprintat BIDSa few of us have drafted a Numpy Enhancement Proposal (NEP)available here.