Skip to content

create_virtual_dataset is slow #226

@ArvidJB

Description

@ArvidJB

Writing virtual datasets seems to be pretty slow because of the calls to deepcopy in VirtualSource.__getitem__:

In [26]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         a = np.random.rand(1, 36, 26, 19)
    ...:         f.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             f['bar'].resize((i + 1, 36, 26, 19))
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             f['bar'][i, :, :, :] = a
    ...:
    ...:
CPU times: user 129 ms, sys: 8.01 ms, total: 137 ms
Wall time: 137 ms

In [27]: %%time
    ...: with TempDirCtx() as d:
    ...:     with h5py.File(d / 'foo.h5', 'w') as f:
    ...:         vf = VersionedHDF5File(f)
    ...:         with vf.stage_version('v0') as sv:
    ...:             a = np.random.rand(1, 36, 26, 19)
    ...:             sv.create_dataset('bar', data=a, chunks=a.shape, maxshape=(None, None, None, None))
    ...:     for i in range(1, 101):
    ...:         with h5py.File(d / 'foo.h5', 'r+') as f:
    ...:             vf = VersionedHDF5File(f)
    ...:             with vf.stage_version('v{i}'.format(i=i)) as sv:
    ...:                 sv['bar'].resize((i + 1, 36, 26, 19))
    ...:                 a = np.random.rand(1, 36, 26, 19)
    ...:                 sv['bar'][[i], ...] = a
    ...:
    ...:
    ...:
CPU times: user 2.65 s, sys: 49.3 ms, total: 2.7 s
Wall time: 2.7 s

Looking at the code it seems that there was some performance optimization there which was broken by h5py version 3.3:
h5py/h5py#1905
Is it possible to work around this performance degradation?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions