-
Notifications
You must be signed in to change notification settings - Fork 22
Description
It is much more expensive to open a versioned Dataset than an unversioned Dataset:
>>> # THIS CELL WAS AUTO-GENERATED BY PYFLYBY
>>> import h5py
>>> import numpy as np
>>> from versioned_hdf5 import VersionedHDF5File
>>> # END AUTO-GENERATED BLOCK
>>> # reading single row from unversioned file
>>> with tempfile.TemporaryDirectory() as d:
... with h5py.File(f'{d}/data.h5', 'w') as f:
... data = f.create_group('data')
... data.create_dataset('values', data=np.random.rand(365, 12345), chunks=(10, 100))
... def read():
... with h5py.File(f'{d}/data.h5', 'r') as f:
... _ = f['data/values'][0]
... print(timeit.timeit(read, number=100))
0.07837390998611227
>>> # reading from versioned file is much slower
>>> with tempfile.TemporaryDirectory() as d:
... with h5py.File(f'{d}/data.h5', 'w') as f:
... vf = VersionedHDF5File(f)
... with vf.stage_version('r0') as sv:
... data = sv.create_group('data')
... data.create_dataset('values', data=np.random.rand(365, 12345), chunks=(10, 100))
... def read():
... with h5py.File(f'{d}/data.h5', 'r') as f:
... vf = VersionedHDF5File(f)
... cv = vf[vf.current_version]
... _ = cv['data/values'][0]
... print(timeit.timeit(read, number=100))
6.281358711014036
Looking at the profile results we can see that the majority of the time is spent in HDF5 itself when reading the metadata for the virtual Dataset. This seems to be the same issue as observed in https://github.com/Quansight/deshaw/issues/496 where h5repack was very slow for the same reason. Looking at the profile time is spent in very similar functions:
We suspect (without much proof) that the most expensive part here is probably reading the identical virtual_filename and virtual_dsetname over-and-over again for each chunk and copying it to a newly allocated string.
Some ideas:
- Easy-ish? Detect if the filename and dsetname are identical to the previous one read and then reuse the previously allocated one (shared strings?). This
- Probably hard: Lazify the loading of the virtual dataset metadata until a chunk is actually read. Currently we read the chunk mapping for all chunks in the Dataset. Instead we could just read (and cache) that information when a chunk is actually accessed.
@crusaderky: this is the issue I had talked about in our meeting today.
@peytondmurray: can we put this request at the top of the queue? This currently blocks rolling out versioned-hdf5 to more groups within the firm.
