Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding indexes with np.where(condition) or np.asarray(condition).nonzero() #30400

Open
StefanieSenger opened this issue Dec 3, 2024 · 4 comments

Comments

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Dec 3, 2024

Throughout the repo, we use np.where(condition) for getting indexes, for instance in SelectorMixin.get_support(), in SimpleImputer.transform() and in several of our examples (example).

The numpy documentation discourages the use of np.where with just passing a condition and recommends np.asarray(condition).nonzero() instead.

For cleanliness of code, should we adopt this recommendation, at least in the examples? Or are there good reasons why we do that?

@StefanieSenger StefanieSenger added Documentation Needs Triage Issue requires triage labels Dec 3, 2024
@StefanieSenger StefanieSenger changed the title Finding indexes with np.where(cond) or np.asarray(condition).nonzero() Finding indexes with np.where(condition) or np.asarray(condition).nonzero() Dec 3, 2024
@ogrisel
Copy link
Member

ogrisel commented Dec 3, 2024

The reason why np.where(condition) is discouraged is documented as follows:

Using nonzero directly should be preferred, as it behaves correctly for subclasses.

I don't recall any issue reported by users using scikit-learn with subclasses of NumPy arrays and having problems as a result. I am not even sure what "behaves correctly" means in that context: returning a NumPy array of integer indices or a subclass instance storing the same indices? For our use cases, I think returning a NumPy array (not subclassing) is perfectly fine, but any option might work.

Let's try:

>>> import numpy as np
>>> class my_array(np.ndarray):
...     pass
... 
>>> condition = np.asarray([True, False, False, True, True]).view(my_array)
>>> condition
my_array([ True, False, False,  True,  True])
>>> np.where(condition)[0]
array([0, 3, 4])
>>> condition.nonzero()[0]
array([0, 3, 4])
>>> np.nonzero(condition)[0]
array([0, 3, 4])

So I don't see any difference: neither returns an instance of the my_array subclass.

I traced this recommendation in the doc to this PR by @eric-wieser: numpy/numpy#11425

However, whenever we convert a scikit-learn estimator to add array API support, the xp.where prototype in the spec implies that the second and third argument are not optional (contrary to calling np.where with a single condition argument):

https://data-apis.org/array-api/latest/API_specification/generated/array_api.where.html#where

Since nonzero is also part of the array API, I think it would make sense to use xp.nonzero instead when we convert code to add array API support:

https://data-apis.org/array-api/latest/API_specification/generated/array_api.nonzero.html#nonzero

Apart from adding array API support, I would rather not change the code unless we identify a good reason to do it (in particular if we identify a user-facing impact). Personally, I am not convinced that one is particularly more readable than the other.

That being said, I would be curious to understand better why np.where(condition) is discouraged in the first place: it's not clear from the NumPy doc and my quick experiment above failed to clarify the issue.

@ogrisel ogrisel removed the Needs Triage Issue requires triage label Dec 3, 2024
@lesteve
Copy link
Member

lesteve commented Dec 5, 2024

Maybe the note about the issue with subclasses is actually a reference to using np.where with a masked array: numpy/numpy#11425 (comment).

@lorentzenchr
Copy link
Member

@eric-wieser @seberg friendly ping. We would like to better understand the numpy recommendation of using np.asarray(x).nonzero() over np.where(x).

Personally, I would follow numpy‘s recommendation without questioning it too much: their realm, their expertise.

@seberg
Copy link
Contributor

seberg commented Dec 10, 2024

where skips the .nonzero() attribute, so if a subclass defines that it is different (maybe only then). Of course these days the subclass could define __array_wrap__ so it is less of a thing maybe.
I never loved the overload of where to have two meanings, although I bet "where" sounds nicer to many than nonzero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants