Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues when using np.array or np.asarray on pycdf.Var type #800

Open
aaronhendry opened this issue Dec 10, 2024 · 2 comments
Open

Comments

@aaronhendry
Copy link
Contributor

aaronhendry commented Dec 10, 2024

When loading variables from a CDF file using pycdf, a significant performance degradation is encountered if the user calls:

from spacepy import pycdf
import numpy as np

fid = pycdf.CDF('data/file/path.cdf')
data = np.array(fid['varname'])  # or np.asarray(...)

compared to

from spacepy import pycdf
import numpy as np

fid = pycdf.CDF('data/file/path.cdf')
data = fid['varname'][...]

These two snippets produce the same data matrices, but the former takes roughly 4x as long (on my Mac when loading a FEDU variable from a MAGEIS L3 data file, it obviously depends strongly on the size of the variable being loaded). I realise that this is not necessarily the intended use-case of the pycdf.Var class, but in theory, casting the Var to an array should be no slower than extracting the data manually as in the second instance. I am also of the slightly opinionated view that the former is more pythonic, but that's not really the point. The reason for the performance issues is because pycdf.Var does not define a __array__ function, which would be called by numpy when trying to convert Var to an ndarray. Instead, numpy iterates over the Var (which is allowed, since Var is defined as a Sequence), which loads the data from the file one value at a time. This results in a huge number of IO calls (in the case of a large data array), significantly slowing things down.

I believe that adding a simple __array__ function would solve this issue:

class Var(MutableSequence, spacepy.datamodel.MetaMixin):
    ...
    def __array__(self, dtype=None, copy=None):
        return self[...]
    ...

I haven't considered side-effects or other ramifications of this though. For instance, I suspect the interaction with NRV variables may be slightly more complicated. I also realise that strictly speaking this is not a necessary function, as calling fid['varname'][...] already produces a numpy array, but in my opinion having the np.array (or np.asarray) case result in siginificantly higher IO and poorer performance violates the principle of least surprise, particularly when the fix is so simple (knock on wood).

The other suggestion, if this is not wanted, would be to add a array method to Var that simply throws a NotImplemented error, to indicate that this is not the intended manner to interact with Var, and avoid users accidentally adding significantly higher overhead than necessary.

@jtniehof
Copy link
Member

Hi Aaron--

Thanks for pointing this out, and I'll look into implementing __array__. This connects with some other considerations regarding datamodel (#719).

I do have one question: you note this as a regression. What version is this a regression relative to--when was it higher performance?

@aaronhendry
Copy link
Contributor Author

Sorry, I think I worded that wrong. I meant in comparison to converting the variable the "proper" way. Poor choice of wording on my part, I blame jetlag. Performance degradation might be a better term.

@aaronhendry aaronhendry changed the title Performance regression when using np.array or np.asarray on pycdf.Var type Performance issues when using np.array or np.asarray on pycdf.Var type Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants