Submit New Event

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Submit News Feature

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Sign up for Newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Jun 20, 2019

Load Large Image Data with Dask Array

By

Executive Summary

This post explores simple workflows to load large stacks of image data with Dask array.

In particular, we start with a directory full of TIFFfilesof images like the following:

$ $ ls raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

and show how to stitch these together into large lazy arraysusing the dask-image library

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

or by writing your own Dask delayed image reader function.

Array Chunk Bytes 3.16 GB 316.15 MB Shape (2010, 1024, 768) (201, 1024, 768) Count 30 Tasks 10 Chunks Type uint16 numpy.ndarray 76810242010

Some day we’ll eventually be able to perform complex calculations on this dask array.

Light Microscopy data rendered with NVidia IndeX
Light Microscopy data rendered with NVidia IndeX

Disclaimer: we’re not going to produce rendered images like the above in thispost. These were created with NVidiaIndeX, a completely separate tool chainfrom what is being discussed here. This post covers the first step of imageloading.

Series Overview

A common case in fields that acquire large amounts of imaging data is to writeout smaller acquisitions into many small files. These files can tile a largerspace, sub-sample from a larger time period, and may contain multiple channels.The acquisition techniques themselves are often state of the art and constantlypushing the envelope in term of how large a field of view can be acquired, atwhat resolution, and what quality.

Once acquired this data presents a number of challenges. Algorithms oftendesigned and tested to work on very small pieces of this data need to be scaledup to work on the full dataset. It might not be clear at the outset what willactually work and so exploration still plays a very big part of the wholeprocess.

Historically this analytical process has involved a lot of custom code. Oftenthe analytical process is stitched together by a series of scripts possibly inseveral different languages that write various intermediate results to disk.Thanks to advances in modern tooling these process can be significantlyimproved. In this series of blogposts, we will outline ways for imagescientists to leverage different tools to move towards a high level, friendly,cohesive, interactive analytical pipeline.

Post Overview

This post in particular focuses on loading and managing large stacks of imagedata in parallel from Python.

Loading large image data can be a complex and often unique problem. Differentgroups may choose to store this across many files on disk, a commodity orcustom database solution, or they may opt to store it in the cloud. Not alldatasets within the same group may be treated the same for a variety ofreasons. In short, this means loading data is a hard and expensive problem.

Despite data being stored in many different ways, often groups want to reapplythe same analytical pipeline to these datasets. However if the data pipeline istightly coupled to a particular way of loading the data for later analyticalsteps, it may be very difficult if not impossible to reuse an existingpipeline. In other words, there is friction between the loading and analysissteps, which frustrates efforts to make things reusable.

Having a modular and general way to load data makes it easy to present datastored differently in a standard way. Further having a standard way to presentdata to analytical pipelines allows that part of the pipeline to focus on whatit does best, analysis! In general, this should decouple these to components ina way that improves the experience of users involved in all parts of thepipeline.

We will useimage datagenerously provided byGokul Upadhyayulaat theAdvanced Bioimaging Centerat UC Berkeley and discussed inthis paper(preprint),though the workloads presented here should work for any kind of imaging data,or array data generally.

Load image data with Dask

Let’s start again with our image data from the top of the post:

$ $ ls /path/to/files/raw/ | head
ex6-2_CamA_ch1_CAM1_stack0000_560nm_0000000msec_0001291795msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0001_560nm_0043748msec_0001335543msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0002_560nm_0087497msec_0001379292msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0003_560nm_0131245msec_0001423040msecAbs_000x_000y_000z_0000t.tif
ex6-2_CamA_ch1_CAM1_stack0004_560nm_0174993msec_0001466788msecAbs_000x_000y_000z_0000t.tif

Load a single sample image with Scikit-Image

To load a single image, we use Scikit-Image:

>>> import glob
>>> filenames = glob.glob("/path/to/files/raw/*.tif")
>>> len(filenames)
597

>>> import imageio
>>> sample = imageio.imread(filenames[0])
>>> sample.shape
(201, 1024, 768)

Each filename corresponds to some 3d chunk of a larger image. We can look at afew 2d slices of this single 3d chunk to get some context.

import matplotlib.pyplot as plt
import skimage.io
plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, :, 0])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[:, 0, :])

plt.figure(figsize=(10, 10))
skimage.io.imshow(sample[0, :, :])

Investigate Filename Structure

These are slices from only one chunk of a much larger aggregate image.Our interest here is combining the pieces into a large image stack.It is common to see a naming structure in the filenames. Eachfilename then may indicate a channel, time step, and spatial location with the<i> being some numeric values (possibly with units). Individual filenames mayhave more or less information and may notate it differently than we have.

mydata_ch<i>_<j>t_<k>x_<l>y_<m>z.tif

In principle with NumPy we might allocate a giant array and then iterativelyload images and place them into the giant array.

full_array = np.empty((..., ..., ..., ..., ...), dtype=sample.dtype)

for fn in filenames:
img = imageio.imread(fn)
index = get_location_from_filename(fn) # We need to write this function
full_array[index, :, :, :] = img

However if our data is large then we can’t load it all into memory at once likethis into a single Numpy array, and instead we need to be a bit more clever tohandle it efficiently. One approach here is to use Dask,which handles larger-than-memory workloads easily.

Lazily load images with Dask Array

Now we learn how to lazily load and stitch together image data with Dask array.We’ll start with simple examples first and then move onto the full example withthis more complex dataset afterwards.

We can delay the imageio.imread calls with DaskDelayed.

import dask
import dask.array as da

lazy_arrays = [dask.delayed(imageio.imread)(fn) for fn in filenames]
lazy_arrays = [da.from_delayed(x, shape=sample.shape, dtype=sample.dtype)
for x in lazy_arrays]

Note: here we’re assuming that all of the images have the same shape and dtypeas the sample file that we loaded above. This is not always the case. See thedask_image note below in the Future Work section for an alternative.

We haven’t yet stitched these together. We have hundreds of single-chunk Daskarrays, each of which lazily loads a single 3d chunk of data from disk. Lets look at a single array.

>>> lazy_arrays[0]

Array Chunk Bytes 316.15 MB 316.15 MB Shape (201, 1024, 768) (201, 1024, 768) Count 2 Tasks 1 Chunks Type uint16 numpy.ndarray 7681024201

This is a lazy 3-dimensional Dask array of a single 300MB chunk of data.That chunk is created by loading in a particular TIFF file. Normally Daskarrays are composed of many chunks. We can concatenate many of thesesingle-chunked Dask arrays into a multi-chunked Dask array with functions likeda.concatenateandda.stack.

Here we concatenate the first ten Dask arrays along a few axes, to get aneasier-to-understand picture of how this looks. Take a look both at how theshape changes as we change the axis= parameter both in the table on the leftand the image on the right.

da.concatenate(lazy_arrays[:10], axis=0)

Array Chunk Bytes 3.16 GB 316.15 MB Shape (2010, 1024, 768) (201, 1024, 768) Count 30 Tasks 10 Chunks Type uint16 numpy.ndarray 76810242010

da.concatenate(lazy_arrays[:10], axis=1)

Array Chunk Bytes 3.16 GB 316.15 MB Shape (201, 10240, 768) (201, 1024, 768) Count 30 Tasks 10 Chunks Type uint16 numpy.ndarray 76810240201

da.concatenate(lazy_arrays[:10], axis=2)

Array Chunk Bytes 3.16 GB 316.15 MB Shape (201, 1024, 7680) (201, 1024, 768) Count 30 Tasks 10 Chunks Type uint16 numpy.ndarray 76801024201

Or, if we wanted to make a new dimension, we would use da.stack. In thiscase note that we’ve run out of easily visible dimensions, so you should takenote of the listed shape in the table input on the left more than the pictureon the right. Notice that we’ve stacked these 3d images into a 4d image.

da.stack(lazy_arrays[:10])

Array Chunk Bytes 3.16 GB 316.15 MB Shape (10, 201, 1024, 768) (1, 201, 1024, 768) Count 30 Tasks 10 Chunks Type uint16 numpy.ndarray 101 7681024201

These are the common case situations, where you have a single axis along whichyou want to stitch images together.

Full example

This works fine for combining along a single axis. However if we need tocombine across multiple we need to perform multiple concatenate steps.Fortunately there is a simpler option da.block, which canconcatenate along multiple axes at once if you give it a nested list of daskarrays.

a = da.block([[laxy_array_00, lazy_array_01],
[lazy_array_10, lazy_array_11]])

We now do the following:

  • Parse each filename to learn where it should live in the larger array
  • See how many files are in each of our relevant dimensions
  • Allocate a NumPy object-dtype array of the appropriate size, where eachelement of this array will hold a single-chunk Dask array
  • Go through our filenames and insert the proper Dask array into the rightposition
  • Call da.block on the result

This code is a bit complex, but shows what this looks like in a real-worldsetting

# Get various dimensions

fn_comp_sets = dict()
for fn in filenames:
for i, comp in enumerate(os.path.splitext(fn)[0].split("_")):
fn_comp_sets.setdefault(i, set())
fn_comp_sets[i].add(comp)
fn_comp_sets = list(map(sorted, fn_comp_sets.values()))

remap_comps = [
dict(map(reversed, enumerate(fn_comp_sets[2]))),
dict(map(reversed, enumerate(fn_comp_sets[4])))
]

# Create an empty object array to organize each chunk that loads a TIFF
a = np.empty(tuple(map(len, remap_comps)) + (1, 1, 1), dtype=object)

for fn, x in zip(filenames, lazy_arrays):
channel = int(fn[fn.index("_ch") + 3:].split("_")[0])
stack = int(fn[fn.index("_stack") + 6:].split("_")[0])

a[channel, stack, 0, 0, 0] = x

# Stitch together the many blocks into a single array
a = da.block(a.tolist())

Array Chunk Bytes 188.74 GB 316.15 MB Shape (3, 199, 201, 1024, 768) (1, 1, 201, 1024, 768) Count 2985 Tasks 597 Chunks Type uint16 numpy.ndarray 1993 7681024201

That’s a 180 GB logical array, composed of around 600 chunks, each of size 300MB. We can now do normal NumPy like computations on this array using DaskArray, but we’ll save that for afuture post.

>>> # array computations would work fine, and would run in low memory
>>> # but we'll save actual computation for future posts
>>> a.sum().compute()

Save Data

To simplify data loading in the future, we store this in a large chunkedarray format like Zarr using the to_zarrmethod.

a.to_zarr("mydata.zarr")

We may add additional information about the image data as attributes. Thisboth makes things simpler for future users (they can read the full dataset witha single line using da.from_zarr) and muchmore performant because Zarr is an analysis ready format that is efficientlyencoded for computation.

Zarr uses the Blosc library for compression by default.For scientific imaging data, we can optionally pass compression options that providea good compression ratio to speed tradeoff and optimize compressionperformance.

from numcodecs import Blosc
a.to_zarr("mydata.zarr", compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE))

Future Work

The workload above is generic and straightforward. It works well in simplecases and also extends well to more complex cases, providing you’re willing towrite some for-loops and parsing code around your custom logic. It works on asingle small-scale laptop as well as a large HPC or Cloud cluster. If you havea function that turns a filename into a NumPy array, you can generate largelazy Dask array using that function, DaskDelayed and DaskArray.

Dask Image

However, we can make things a bit easier for users if we specialize a bit. Forexample the Dask Image library has aparallel image reader function, which automates much of our work above in thesimple case.

>>> import dask_image
>>> x = dask_image.imread.imread('raw/*.tif')

Similarly libraries like Xarray havereaders for other file formats, like GeoTIFF.

As domains do more and more work like what we did above they tend to write downcommon patterns into domain-specific libraries, which then increases theaccessibility and user base of these tools.

GPUs

If we have special hardware lying around like a few GPUs, we can move the dataover to it and perform computations with a library like CuPy, which mimicsNumPy very closely. Thus benefiting from the same operations listed above, butwith the added performance of GPUs behind them.

import cupy as cp
a_gpu = a.map_blocks(cp.asarray)

Computation

Finally, in future blogposts we plan to talk about how to compute on our largeDask arrays using common image-processing workloads like overlapping stencilfunctions, segmentation and deconvolution, and integrating with other librarieslike ITK.