Skip to content

Slow writing with many small chunks #167

Open
@ArvidJB

Description

@ArvidJB

Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned h5py is able to handle those without issues, but with versioned_hdf5 this turns out to be pretty slow:

dt = np.dtype('double')
d0 = 2
d1 = 15220
d2 = 2
chunks = (600, 2, 4)
with h5py.File('foo.h5', 'w') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version('0') as sv:
        sv.create_dataset('bar', shape=(d0, d1, d2), maxshape=(None, None, None),
                          chunks=chunks, dtype=dt,
                          data=np.full((d0, d1, d2), 0, dtype=dt))

start = time.time()
with h5py.File('foo.h5', 'r+') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version(str(i)) as sv:
        i2 = np.random.choice(d1, 30, replace=False)
        i2 = np.sort(i2)
        sv['bar'][:, i2, :] = np.full((d0, len(i2), d2), i, dtype=dt)
end = time.time()
print('writing: {}'.format(end - start))

This takes around 9 seconds for me to write 120 numbers.

A little bit of profiling points to two things:

  1. The call to as_subchunks in InMemoryDataset.__setitem__
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   593                                               @with_phil
   594                                               @profile
   595                                               def __setitem__(self, args, val):
...
   700        78   24219378.0 310504.8     99.0          for c in self.chunks.as_subchunks(idx, self.shape):
...

where it ends up calling _fallback because there is no case for IntegerArray. Could we not use the same code path as for Integer?

  1. The other slow spot is this loop in create_virtual_dataset:
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   170                                           @profile
   171                                           def create_virtual_dataset(f, version_name, name, shape, slices, attrs=None, fillvalue=None):
....
   192     26127      50638.0      1.9      0.2          for c, s in slices.items():
   193     26124    1592688.0     61.0      6.6              if c.isempty():
   194                                                           continue
   195                                                       # idx = Tuple(s, *Tuple(*[slice(0, i) for i in shape[1:]]).as_subindex(Tuple(*c.args[1:])).args)
   196     26124    5472288.0    209.5     22.8              S = [Slice(0, shape[i], 1).as_subindex(c.args[i]) for i in range(1, len(shape))]
   197     26123    1725495.0     66.1      7.2              idx = Tuple(s, *S)
   198                                                       # assert c.newshape(shape) == vs[idx.raw].shape, (c, shape, s)
   199     26123   12892876.0    493.5     53.8              layout[c.raw] = vs[idx.raw]
...

Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions