flox: fast & furious GroupBy reductions for dask.array
¶
Overview¶
flox
mainly provides strategies for fast GroupBy reductions with dask.array. flox
uses the MapReduce paradigm (or a “tree reduction”)
to run the GroupBy operation in a parallel-native way totally avoiding a sort or shuffle operation. It was motivated by
See a presentation (video, slides) about this package, from the Pangeo Showcase.
Why flox?¶
flox.groupby_reduce()
wraps thenumpy-groupies
package for performant Groupby reductions on nD arrays.flox.groupby_reduce()
provides parallel-friendly strategies for GroupBy reductions by wrappingnumpy-groupies
for dask arrays.flox
integrates with xarray to provide more performant Groupby and Resampling operations.flox.xarray.xarray_reduce()
extends Xarray’s GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays, as well as combining categorical grouping and histogram-style binning operations using multiple variables.flox
also provides utility functions for rechunking both dask arrays and Xarray objects along a single dimension using the group labels as a guide:To rechunk for blockwise operations:
flox.rechunk_for_blockwise()
,flox.xarray.rechunk_for_blockwise()
.To rechunk so that “cohorts”, or groups of labels, tend to occur in the same chunks:
flox.rechunk_for_cohorts()
,flox.xarray.rechunk_for_cohorts()
.
Installing¶
$ pip install flox
$ conda install -c conda-forge flox
Acknowledgements¶
This work was funded in part by
NASA-ACCESS 80NSSC18M0156 “Community tools for analysis of NASA Earth Observing System Data in the Cloud” (PI J. Hamman),
NASA-OSTFL 80NSSC22K0345 “Enhancing analysis of NASA data with the open-source Python Xarray Library” (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
It was motivated by many discussions in the Pangeo community.