You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When loading variables from a CDF file using pycdf, a significant performance degradation is encountered if the user calls:
from spacepy import pycdf
import numpy as np
fid = pycdf.CDF('data/file/path.cdf')
data = np.array(fid['varname']) # or np.asarray(...)
compared to
from spacepy import pycdf
import numpy as np
fid = pycdf.CDF('data/file/path.cdf')
data = fid['varname'][...]
These two snippets produce the same data matrices, but the former takes roughly 4x as long (on my Mac when loading a FEDU variable from a MAGEIS L3 data file, it obviously depends strongly on the size of the variable being loaded). I realise that this is not necessarily the intended use-case of the pycdf.Var class, but in theory, casting the Var to an array should be no slower than extracting the data manually as in the second instance. I am also of the slightly opinionated view that the former is more pythonic, but that's not really the point. The reason for the performance issues is because pycdf.Var does not define a __array__ function, which would be called by numpy when trying to convert Var to an ndarray. Instead, numpy iterates over the Var (which is allowed, since Var is defined as a Sequence), which loads the data from the file one value at a time. This results in a huge number of IO calls (in the case of a large data array), significantly slowing things down.
I believe that adding a simple __array__ function would solve this issue:
I haven't considered side-effects or other ramifications of this though. For instance, I suspect the interaction with NRV variables may be slightly more complicated. I also realise that strictly speaking this is not a necessary function, as calling fid['varname'][...] already produces a numpy array, but in my opinion having the np.array (or np.asarray) case result in siginificantly higher IO and poorer performance violates the principle of least surprise, particularly when the fix is so simple (knock on wood).
The other suggestion, if this is not wanted, would be to add a array method to Var that simply throws a NotImplemented error, to indicate that this is not the intended manner to interact with Var, and avoid users accidentally adding significantly higher overhead than necessary.
The text was updated successfully, but these errors were encountered:
Sorry, I think I worded that wrong. I meant in comparison to converting the variable the "proper" way. Poor choice of wording on my part, I blame jetlag. Performance degradation might be a better term.
aaronhendry
changed the title
Performance regression when using np.array or np.asarray on pycdf.Var type
Performance issues when using np.array or np.asarray on pycdf.Var type
Dec 10, 2024
When loading variables from a CDF file using pycdf, a significant performance degradation is encountered if the user calls:
compared to
These two snippets produce the same data matrices, but the former takes roughly 4x as long (on my Mac when loading a FEDU variable from a MAGEIS L3 data file, it obviously depends strongly on the size of the variable being loaded). I realise that this is not necessarily the intended use-case of the pycdf.Var class, but in theory, casting the Var to an array should be no slower than extracting the data manually as in the second instance. I am also of the slightly opinionated view that the former is more pythonic, but that's not really the point. The reason for the performance issues is because pycdf.Var does not define a
__array__
function, which would be called by numpy when trying to convert Var to an ndarray. Instead, numpy iterates over the Var (which is allowed, since Var is defined as a Sequence), which loads the data from the file one value at a time. This results in a huge number of IO calls (in the case of a large data array), significantly slowing things down.I believe that adding a simple
__array__
function would solve this issue:I haven't considered side-effects or other ramifications of this though. For instance, I suspect the interaction with NRV variables may be slightly more complicated. I also realise that strictly speaking this is not a necessary function, as calling
fid['varname'][...]
already produces a numpy array, but in my opinion having thenp.array
(ornp.asarray
) case result in siginificantly higher IO and poorer performance violates the principle of least surprise, particularly when the fix is so simple (knock on wood).The other suggestion, if this is not wanted, would be to add a array method to
Var
that simply throws a NotImplemented error, to indicate that this is not the intended manner to interact withVar
, and avoid users accidentally adding significantly higher overhead than necessary.The text was updated successfully, but these errors were encountered: