-
Notifications
You must be signed in to change notification settings - Fork 186
Closed
Description
Following up on today's early discussion with @mrocklin, here is an example of a calculation that is currently too "big" for our current pangeo.pydata.org cluster to handle.
It's a pretty simple case:
import xarray as xr
import gcsfs
# enable xarray's new cftime index for non-standard dates
xr.set_options(enable_cftimeindex=True)
# open the dataset (11.3 TB in 292000 chunks)
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/cm2.6/control/temp_salt_u_v-5day_avg/')
ds = xr.open_zarr(gcsmap)
# calculate the squares of all the data variables
ds_squared = ds**2
# and put them back into the original dataset (doubling the size and number of tasks)
for var in ds_squared.data_vars:
ds[var + '_squared'] = ds_squared[var]
# calculate the monthly-mean climatology and persist for further analysis / visualization
ds_mm_clim = ds.groupby('time.month').mean(dim='time')
# 186 GB in 4800 chunks
ds_mm_clim.persist()I ran this on a cluster with 100 workers (~600 GB of memory). It got to the scheduler and showed up on the dashboard after ~10 minutes. There were over a million tasks. Workers started to crash and then the notebook crashed.
Some thoughts:
- The dask graph itself must be huge. The workers only have 6GB of memory, and the notebook 14. Everyone is running out of memory. Do we just need way more memory? That can be accomplished by changing the image types. We could get orders of magnitude more memory if that's what it will take.
- The graph must be very repetitive. Is there an opportunity for some sort of optimization / compression to reduce the size of the graph?
- Do we really need to send the whole calculation through it once? Or is possible to "stream" this calculation somehow.
Reactions are currently unavailable