Description
Code Sample, a copy-pastable example if possible
I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.
The easiest way to summarize the problem is with an example. Consider the following dataset
import xarray as xr
ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])},
coords={'x': (['x'], [0, 1, 2, 3]),
'bar': (['x'], ['a', 'a', 'b', 'b']),
'baz': (['x'], ['a', 'b', 'a', 'b'])})
ds = ds.chunk({'x': 2})
ds
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Data variables:
foo (x) int64 dask.array<shape=(4,), chunksize=(2,)>
One non-dimension coordinate (bar
) is contiguous with respect to x
while the other baz
is not. This is important. baz
is structured similar to the way that month
would be distributed on a timeseries dataset.
Now let's do a trivial groupby operation on bar
that does nothing, just returns the group unchanged:
ds.foo.groupby('bar').apply(lambda x: x)
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(2,)>
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
This operation preserved this original chunks in foo
. But if we group by baz
we see something different
ds.foo.groupby('baz').apply(lambda x: x)
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(4,)>
Coordinates:
* x (x) int64 0 1 2 3
bar (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
baz (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Problem description
When grouping over a non-contiguous variable (baz
) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like
ds = xr.open_mfdataset('lots/of/files/*.nc')
ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)
It is currently impossible to do this lazily due to the issue described above.
Expected Output
We would like to preserve the original chunk structure of foo
.
Output of xr.show_versions()
xr.show_versions()
is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.