Open
Description
Unfortunately in our use case we often end up with suboptimal chunk sizes. Unversioned h5py
is able to handle those without issues, but with versioned_hdf5
this turns out to be pretty slow:
dt = np.dtype('double')
d0 = 2
d1 = 15220
d2 = 2
chunks = (600, 2, 4)
with h5py.File('foo.h5', 'w') as f:
vf = VersionedHDF5File(f)
with vf.stage_version('0') as sv:
sv.create_dataset('bar', shape=(d0, d1, d2), maxshape=(None, None, None),
chunks=chunks, dtype=dt,
data=np.full((d0, d1, d2), 0, dtype=dt))
start = time.time()
with h5py.File('foo.h5', 'r+') as f:
vf = VersionedHDF5File(f)
with vf.stage_version(str(i)) as sv:
i2 = np.random.choice(d1, 30, replace=False)
i2 = np.sort(i2)
sv['bar'][:, i2, :] = np.full((d0, len(i2), d2), i, dtype=dt)
end = time.time()
print('writing: {}'.format(end - start))
This takes around 9 seconds for me to write 120 numbers.
A little bit of profiling points to two things:
- The call to
as_subchunks
inInMemoryDataset.__setitem__
Line # Hits Time Per Hit % Time Line Contents
==============================================================
593 @with_phil
594 @profile
595 def __setitem__(self, args, val):
...
700 78 24219378.0 310504.8 99.0 for c in self.chunks.as_subchunks(idx, self.shape):
...
where it ends up calling _fallback
because there is no case for IntegerArray
. Could we not use the same code path as for Integer
?
- The other slow spot is this loop in
create_virtual_dataset
:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
170 @profile
171 def create_virtual_dataset(f, version_name, name, shape, slices, attrs=None, fillvalue=None):
....
192 26127 50638.0 1.9 0.2 for c, s in slices.items():
193 26124 1592688.0 61.0 6.6 if c.isempty():
194 continue
195 # idx = Tuple(s, *Tuple(*[slice(0, i) for i in shape[1:]]).as_subindex(Tuple(*c.args[1:])).args)
196 26124 5472288.0 209.5 22.8 S = [Slice(0, shape[i], 1).as_subindex(c.args[i]) for i in range(1, len(shape))]
197 26123 1725495.0 66.1 7.2 idx = Tuple(s, *S)
198 # assert c.newshape(shape) == vs[idx.raw].shape, (c, shape, s)
199 26123 12892876.0 493.5 53.8 layout[c.raw] = vs[idx.raw]
...
Is it possible to speed this up? In this example we only change some very small subset of the data. If we could keep track of the changes we could probably copy the old virtual dataset and modify it appropriately?
Metadata
Metadata
Assignees
Labels
No labels