-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Milestone
Description
Writing virtual datasets seems to be pretty slow because of the calls to deepcopy in VirtualSource.__getitem__:
In [26]: %%time
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: a = np.random.rand(1, 36, 26, 19)
...: f.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
...: for i in range(1, 101):
...: with h5py.File(d / 'foo.h5', 'r+') as f:
...: f['bar'].resize((i + 1, 36, 26, 19))
...: a = np.random.rand(1, 36, 26, 19)
...: f['bar'][i, :, :, :] = a
...:
...:
CPU times: user 129 ms, sys: 8.01 ms, total: 137 ms
Wall time: 137 ms
In [27]: %%time
...: with TempDirCtx() as d:
...: with h5py.File(d / 'foo.h5', 'w') as f:
...: vf = VersionedHDF5File(f)
...: with vf.stage_version('v0') as sv:
...: a = np.random.rand(1, 36, 26, 19)
...: sv.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
...: for i in range(1, 101):
...: with h5py.File(d / 'foo.h5', 'r+') as f:
...: vf = VersionedHDF5File(f)
...: with vf.stage_version('v{i}'.format(i=i)) as sv:
...: sv['bar'].resize((i + 1, 36, 26, 19))
...: a = np.random.rand(1, 36, 26, 19)
...: sv['bar'][[i], ...] = a
...:
...:
...:
CPU times: user 2.65 s, sys: 49.3 ms, total: 2.7 s
Wall time: 2.7 s
Looking at the code it seems that there was some performance optimization there which was broken by h5py version 3.3:
h5py/h5py#1905
Is it possible to work around this performance degradation?
Reactions are currently unavailable