Slow writing with many small chunks

Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned `h5py` is able to handle those without issues, but with `versioned_hdf5` this turns out to be pretty slow:
```
dt = np.dtype('double')
d0 = 2
d1 = 15220
d2 = 2
chunks = (600, 2, 4)
with h5py.File('foo.h5', 'w') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version('0') as sv:
        sv.create_dataset('bar', shape=(d0, d1, d2), maxshape=(None, None, None),
                          chunks=chunks, dtype=dt,
                          data=np.full((d0, d1, d2), 0, dtype=dt))

start = time.time()
with h5py.File('foo.h5', 'r+') as f:
    vf = VersionedHDF5File(f)
    with vf.stage_version(str(i)) as sv:
        i2 = np.random.choice(d1, 30, replace=False)
        i2 = np.sort(i2)
        sv['bar'][:, i2, :] = np.full((d0, len(i2), d2), i, dtype=dt)
end = time.time()
print('writing: {}'.format(end - start))
```
This takes around 9 seconds for me to write 120 numbers.

A little bit of profiling points to two things:
1. The call to `as_subchunks` in `InMemoryDataset.__setitem__`
```
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   593                                               @with_phil
   594                                               @profile
   595                                               def __setitem__(self, args, val):
...
   700        78   24219378.0 310504.8     99.0          for c in self.chunks.as_subchunks(idx, self.shape):
...
```
where it ends up calling `_fallback` because there is no case for `IntegerArray`. Could we not use the same code path as for `Integer`?

2. The other slow spot is this loop in `create_virtual_dataset`:
```
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   170                                           @profile
   171                                           def create_virtual_dataset(f, version_name, name, shape, slices, attrs=None, fillvalue=None):
....
   192     26127      50638.0      1.9      0.2          for c, s in slices.items():
   193     26124    1592688.0     61.0      6.6              if c.isempty():
   194                                                           continue
   195                                                       # idx = Tuple(s, *Tuple(*[slice(0, i) for i in shape[1:]]).as_subindex(Tuple(*c.args[1:])).args)
   196     26124    5472288.0    209.5     22.8              S = [Slice(0, shape[i], 1).as_subindex(c.args[i]) for i in range(1, len(shape))]
   197     26123    1725495.0     66.1      7.2              idx = Tuple(s, *S)
   198                                                       # assert c.newshape(shape) == vs[idx.raw].shape, (c, shape, s)
   199     26123   12892876.0    493.5     53.8              layout[c.raw] = vs[idx.raw]
...
```
Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow writing with many small chunks #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow writing with many small chunks #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions