Description
The goal of this issue is to discuss the design and prototype a way to register alternative implementations for core low level routines in scikit-learn, in particular to benefit from hardware optimized implementations (e.g. using GPUs efficiently).
Motivation
scikit-learn aims to provide reasonably easy to maintain and portable implementations of standard machine learning algorithms. Those implementations are typically written in Python (with the help of NumPy and SciPy) or in Cython when the overhead of the Python interpreter prevents us to efficiently implement algorithms with (nested) tight loops. This allows us reasonably fast implementations as binary packages (installable with pip/PyPI, conda/conda-forge, conda/anaconda or Linux distros) for a variety of platforms (Linux / macOS / Windows) x (x86_64, i686, arm64, ppc64le) from a single code-base with no external runtime dependencies beyond Python, NumPy and SciPy.
Recently, GPU hardware have proven very competitive for many machine learning related workloads, either from a pure latency standpoint, or from the standpoint of a better computation/energy trade-off (irrespective of the raw speed considerations). However, hardware optimized implementation are typically not portable and mandates additional dependencies.
We therefore propose to design a way for our users to register alternative implementations of low-level computation routines in scikit-learn provided they has installed the required extension package(s) that matches their specific hardware.
Relationship to adopting the Array API spec
This proposal is related and complementary to another effort, namely:
The Array API spec is useful to make it possible to have some scikit-learn estimators written using a pure numpy syntax to delegate the computation to alternative Array API compatible libraries such as CuPy.
However, some algorithms in scikit-learn cannot be efficiently written using NumPy operation only, for instance the main K-Means loop is written in Cython to process chunks of samples in parallel (using prange
and OpenMP), compute distance with centroids and reduce those distances to find assign each sample to its closest centroid on-the-fly while preventing unnecessary memory transfer between CPU cache and RAM.
If we want to run this algorithm efficiently on GPU hardware, one would need to dispatch the computation of this low level function to an alternative implementation that can work on GPU, either written in C/C++ with GPU-specific supporting runtime libraries and compilers (e.g. OpenCL, NVIDIA Cuda, Intel oneAPI DPC++, AMD ROCm...) or using a Python syntax with the help of GPU support provided in numba for instance.
List of candidate routines
- the main k-means loop (pairwise distances between samples and centroids with on-the-fly argmin reduction)
- the core k-nearest neighbors computation loop (pairwise distances with on-the-fly arg-k-min reduction)
- pairwise distances (without arg-k-min reduction)
- pairwise kernels computation (e.g. for use in Nystroem)
- ...
Explicit registration API design ideas
I started to draft some API ideas in:
Feel free to comment here, or there.
This design is subject to evolve, in particular to make it possible to register both Array API extensions and non-Array API extensions with the same registration API.
Next steps
- start a proof of concept implementation in a branch + external repo using either numba or C++ for k-means. Edit: this is happening here:
- once the API design has converged on the PoC, write a formal SLEP.
Metadata
Metadata
Assignees
Type
Projects
Status
Discussion
Activity