flox: fast & furious GroupBy reductions for dask.array

GitHub Workflow CI Status pre-commit.ci status image Documentation Status

PyPI Conda-forge

NASA-80NSSC18M0156 NASA-80NSSC22K0345

Overview

flox mainly provides strategies for fast GroupBy reductions with dask.array. flox uses the MapReduce paradigm (or a “tree reduction”) to run the GroupBy operation in a parallel-native way totally avoiding a sort or shuffle operation. It was motivated by

  1. Dask Dataframe GroupBy blogpost

  2. numpy_groupies in Xarray issue

See a presentation (video, slides) about this package, from the Pangeo Showcase.

Why flox?

  1. flox.groupby_reduce() wraps the numpy-groupies package for performant Groupby reductions on nD arrays.

  2. flox.groupby_reduce() provides parallel-friendly strategies for GroupBy reductions by wrapping numpy-groupies for dask arrays.

  3. flox integrates with xarray to provide more performant Groupby and Resampling operations.

  4. flox.xarray.xarray_reduce() extends Xarray’s GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays, as well as combining categorical grouping and histogram-style binning operations using multiple variables.

  5. flox also provides utility functions for rechunking both dask arrays and Xarray objects along a single dimension using the group labels as a guide:

    1. To rechunk for blockwise operations: flox.rechunk_for_blockwise(), flox.xarray.rechunk_for_blockwise().

    2. To rechunk so that “cohorts”, or groups of labels, tend to occur in the same chunks: flox.rechunk_for_cohorts(), flox.xarray.rechunk_for_cohorts().

Installing

$ pip install flox
$ conda install -c conda-forge flox

Acknowledgements

This work was funded in part by

  1. NASA-ACCESS 80NSSC18M0156 “Community tools for analysis of NASA Earth Observing System Data in the Cloud” (PI J. Hamman),

  2. NASA-OSTFL 80NSSC22K0345 “Enhancing analysis of NASA data with the open-source Python Xarray Library” (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and

  3. NCAR’s Earth System Data Science Initiative.

It was motivated by many discussions in the Pangeo community.

Contents