Sparse
This implements sparse arrays of arbitrary dimension on top of numpy
and scipy.sparse
.
It generalizes the scipy.sparse.coo_matrix
and scipy.sparse.dok_matrix
layouts,
but extends beyond just rows and columns to an arbitrary number of dimensions.
Additionally, this project maintains compatibility with the numpy.ndarray
interface
rather than the numpy.matrix
interface used in scipy.sparse
These differences make this project useful in certain situations where scipy.sparse matrices are not well suited, but it should not be considered a full replacement. The data structures in pydata/sparse complement and can be used in conjunction with the fast linear algebra routines inside scipy.sparse. A format conversion or copy may be required.
Motivation
Sparse arrays, or arrays that are mostly empty or filled with zeros, are common in many scientific applications. To save space we often avoid storing these arrays in traditional dense formats, and instead choose different data structures. Our choice of data structure can significantly affect our storage and computational costs when working with these arrays.
Design
The main data structure in this library follows the Coordinate List (COO) layout for sparse matrices, but extends it to multiple dimensions.
The COO layout, which stores the row index, column index, and value of every element:
row |
col |
data |
---|---|---|
0 |
0 |
10 |
0 |
2 |
13 |
1 |
3 |
9 |
3 |
8 |
21 |
It is straightforward to extend the COO layout to an arbitrary number of dimensions:
dim1 |
dim2 |
dim3 |
… |
data |
---|---|---|---|---|
0 |
0 |
0 |
. |
10 |
0 |
0 |
3 |
. |
13 |
0 |
2 |
2 |
. |
9 |
3 |
1 |
4 |
. |
21 |
This makes it easy to store a multidimensional sparse array, but we still need to reimplement all of the array operations like transpose, reshape, slicing, tensordot, reductions, etc., which can be challenging in general.
This library also includes several other data structures. Similar to COO, the Dictionary of Keys (DOK) format for sparse matrices generalizes well to an arbitrary number of dimensions. DOK is well-suited for writing and mutating. Most other operations are not supported for DOK. A common workflow may involve writing an array with DOK and then converting to another format for other operations.
The Compressed Sparse Row/Column (CSR/CSC) formats are widely used in scientific computing are now supported by pydata/sparse. The CSR/CSC formats excel at compression and mathematical operations. While these formats are restricted to two dimensions, pydata/sparse supports the GCXS sparse array format, based on GCRS/GCCS from which generalizes CSR/CSC to n-dimensional arrays. Like their two-dimensional CSR/CSC counterparts, GCXS arrays compress well. Whereas the storage cost of COO depends heavily on the number of dimensions of the array, the number of dimensions only minimally affects the storage cost of GCXS arrays, which results in favorable compression ratios across many use cases.
Together these formats cover a wide array of applications of sparsity.
Additionally, with each format complying with the numpy.ndarray
interface and
following the appropriate dispatching protocols,
pydata/sparse arrays can interact with other array libraries and seamlessly
take part in pydata-ecosystem-based workflows.
LICENSE
This library is licensed under BSD-3