Performance issues when using np.array or np.asarray on pycdf.Var type #800

aaronhendry · 2024-12-10T19:40:25Z

When loading variables from a CDF file using pycdf, a significant performance degradation is encountered if the user calls:

from spacepy import pycdf
import numpy as np

fid = pycdf.CDF('data/file/path.cdf')
data = np.array(fid['varname'])  # or np.asarray(...)

compared to

from spacepy import pycdf
import numpy as np

fid = pycdf.CDF('data/file/path.cdf')
data = fid['varname'][...]

These two snippets produce the same data matrices, but the former takes roughly 4x as long (on my Mac when loading a FEDU variable from a MAGEIS L3 data file, it obviously depends strongly on the size of the variable being loaded). I realise that this is not necessarily the intended use-case of the pycdf.Var class, but in theory, casting the Var to an array should be no slower than extracting the data manually as in the second instance. I am also of the slightly opinionated view that the former is more pythonic, but that's not really the point. The reason for the performance issues is because pycdf.Var does not define a __array__ function, which would be called by numpy when trying to convert Var to an ndarray. Instead, numpy iterates over the Var (which is allowed, since Var is defined as a Sequence), which loads the data from the file one value at a time. This results in a huge number of IO calls (in the case of a large data array), significantly slowing things down.

I believe that adding a simple __array__ function would solve this issue:

class Var(MutableSequence, spacepy.datamodel.MetaMixin):
    ...
    def __array__(self, dtype=None, copy=None):
        return self[...]
    ...

I haven't considered side-effects or other ramifications of this though. For instance, I suspect the interaction with NRV variables may be slightly more complicated. I also realise that strictly speaking this is not a necessary function, as calling fid['varname'][...] already produces a numpy array, but in my opinion having the np.array (or np.asarray) case result in siginificantly higher IO and poorer performance violates the principle of least surprise, particularly when the fix is so simple (knock on wood).

The other suggestion, if this is not wanted, would be to add a array method to Var that simply throws a NotImplemented error, to indicate that this is not the intended manner to interact with Var, and avoid users accidentally adding significantly higher overhead than necessary.

The text was updated successfully, but these errors were encountered:

jtniehof · 2024-12-10T20:05:33Z

Hi Aaron--

Thanks for pointing this out, and I'll look into implementing __array__. This connects with some other considerations regarding datamodel (#719).

I do have one question: you note this as a regression. What version is this a regression relative to--when was it higher performance?

aaronhendry · 2024-12-10T21:04:27Z

Sorry, I think I worded that wrong. I meant in comparison to converting the variable the "proper" way. Poor choice of wording on my part, I blame jetlag. Performance degradation might be a better term.

aaronhendry changed the title ~~Performance regression when using np.array or np.asarray on pycdf.Var type~~ Performance issues when using np.array or np.asarray on pycdf.Var type Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues when using np.array or np.asarray on pycdf.Var type #800

Performance issues when using np.array or np.asarray on pycdf.Var type #800

aaronhendry commented Dec 10, 2024 •

edited

Loading

jtniehof commented Dec 10, 2024

aaronhendry commented Dec 10, 2024

Performance issues when using np.array or np.asarray on pycdf.Var type #800

Performance issues when using np.array or np.asarray on pycdf.Var type #800

Comments

aaronhendry commented Dec 10, 2024 • edited Loading

jtniehof commented Dec 10, 2024

aaronhendry commented Dec 10, 2024

aaronhendry commented Dec 10, 2024 •

edited

Loading