why time grouping doesn't preserve chunks

#### Code Sample, a copy-pastable example if possible

I am continuing my quest to obtain more efficient time grouping for calculation of climatologies and climatological anomalies. I believe this is one of the major performance bottlenecks facing xarray users today. I have raised this in other issues (e.g. #1832), but I believe I have narrowed it down here to a more specific problem.

The easiest way to summarize the problem is with an example. Consider the following dataset

```python
import xarray as xr
ds = xr.Dataset({'foo': (['x'], [1, 1, 1, 1])},
                coords={'x': (['x'], [0, 1, 2, 3]),
                        'bar': (['x'], ['a', 'a', 'b', 'b']),
                        'baz': (['x'], ['a', 'b', 'a', 'b'])})
ds = ds.chunk({'x': 2})
ds
```
```
<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
Data variables:
    foo      (x) int64 dask.array<shape=(4,), chunksize=(2,)>
```

One non-dimension coordinate (`bar`) is contiguous with respect to `x` while the other `baz` is not. This is important. `baz` is structured similar to the way that `month` would be distributed on a timeseries dataset.

Now let's do a trivial groupby operation on `bar` that does nothing, just returns the group unchanged:
```python
ds.foo.groupby('bar').apply(lambda x: x)
```
```
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(2,)>
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
```
This operation *preserved this original chunks in `foo`*. But if we group by `baz` we see something different
```python
ds.foo.groupby('baz').apply(lambda x: x)
```
```
<xarray.DataArray 'foo' (x: 4)>
dask.array<shape=(4,), dtype=int64, chunksize=(4,)>
Coordinates:
  * x        (x) int64 0 1 2 3
    bar      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
    baz      (x) <U1 dask.array<shape=(4,), chunksize=(2,)>
```

#### Problem description

When grouping over a non-contiguous variable (`baz`) the result has no chunks. That means that we can't lazily access a single item without computing the whole array. This has major performance consequences that make it hard to calculate anomaly values in a more realistic case. What we really want to do is often something like
```
ds = xr.open_mfdataset('lots/of/files/*.nc')
ds_anom = ds.groupby('time.month').apply(lambda x: x - x.mean(dim='time)
```
It is currently impossible to do this lazily due to the issue described above. 

#### Expected Output

We would like to preserve the original chunk structure of `foo`.

#### Output of ``xr.show_versions()``

`xr.show_versions()` is triggering a segfault right now on my system for unknown reasons! I am using xarray 0.10.7.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why time grouping doesn't preserve chunks #2237

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of `xr.show_versions()`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development