Computing/storing a `MultiscaleSpatialImage` with dask array data

Hi, I recently discovered `spatial-image` and `multiscale-spatial-image` and find it super useful to have an xarray based representation of NGFF datasets (and multiscale images for that matter!).

Exploring how to best use this approach, I came across the following observation. Consider the following example:

```python
import spatial_image as si
import multiscale_spatial_image as msi
import dask.array as da
from dask.diagnostics import ProgressBar

sim = si.to_spatial_image(da.zeros((10,10,10)))

msim = msi.to_multiscale(sim, scale_factors=[1, 2])
print(msim.groups)

with ProgressBar():
    # msim.to_zarr('file.zarr')
    msim.compute()
```
> ('/', '/scale0', '/scale1', '/scale2')
[########################################] | 100% Completed | 104.17 ms
[########################################] | 100% Completed | 105.15 ms
[########################################] | 100% Completed | 115.48 ms

When computing (or writing) a `MultiscaleSpatialImage`, the scale datasets are computed sequentially and independently, as indicated in the above output by the presence of three separate progress bars. And indeed if I understand correctly, xarray datatree implements the `compute` method by [calling it independently on each tree node](https://github.com/xarray-contrib/datatree/blob/696cec9e6288ba9e8c473cd1ba527122edef2b1c/datatree/ops.py#L26). This means that although the dask arrays of higher scales depend on the lowest scale, each scale is recomputed from scratch each time (caching is probably not relevant in the context of large enough data).

In my current understanding this leads to redundant computations, especially relevant for large/expensive underlying dask graphs/computations. It seems more desirable to obtain the downsampled data directly from the full resolution data, as chunks become computed.

Is there currently a good way to obtain this behaviour for `MultiscaleSpatialImage`s?

I guess in principle one would want to bundle the computation of all tree nodes into a single call to dask.compute, so that the scheduler can take care of optimizations. A reasonable alternative to the above exemplified workflow would probably be to stream a `SpatialImage` to file first, and then stream this into a `MultiscaleSpatialImage`. However, this of course involves an additional i/o step.

Please excuse the lengthy issue and thanks a lot for any feedback / discussion :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computing/storing a `MultiscaleSpatialImage` with dask array data #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Computing/storing a MultiscaleSpatialImage with dask array data #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Computing/storing a `MultiscaleSpatialImage` with dask array data #78