Finding indexes with `np.where(condition)` or `np.asarray(condition).nonzero()` #30400

StefanieSenger · 2024-12-03T12:57:54Z

Throughout the repo, we use np.where(condition) for getting indexes, for instance in SelectorMixin.get_support(), in SimpleImputer.transform() and in several of our examples (example).

The numpy documentation discourages the use of np.where with just passing a condition and recommends np.asarray(condition).nonzero() instead.

For cleanliness of code, should we adopt this recommendation, at least in the examples? Or are there good reasons why we do that?

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-12-03T13:57:39Z

The reason why np.where(condition) is discouraged is documented as follows:

Using nonzero directly should be preferred, as it behaves correctly for subclasses.

I don't recall any issue reported by users using scikit-learn with subclasses of NumPy arrays and having problems as a result. I am not even sure what "behaves correctly" means in that context: returning a NumPy array of integer indices or a subclass instance storing the same indices? For our use cases, I think returning a NumPy array (not subclassing) is perfectly fine, but any option might work.

Let's try:

>>> import numpy as np
>>> class my_array(np.ndarray):
...     pass
... 
>>> condition = np.asarray([True, False, False, True, True]).view(my_array)
>>> condition
my_array([ True, False, False,  True,  True])
>>> np.where(condition)[0]
array([0, 3, 4])
>>> condition.nonzero()[0]
array([0, 3, 4])
>>> np.nonzero(condition)[0]
array([0, 3, 4])

So I don't see any difference: neither returns an instance of the my_array subclass.

I traced this recommendation in the doc to this PR by @eric-wieser: numpy/numpy#11425

However, whenever we convert a scikit-learn estimator to add array API support, the xp.where prototype in the spec implies that the second and third argument are not optional (contrary to calling np.where with a single condition argument):

https://data-apis.org/array-api/latest/API_specification/generated/array_api.where.html#where

Since nonzero is also part of the array API, I think it would make sense to use xp.nonzero instead when we convert code to add array API support:

https://data-apis.org/array-api/latest/API_specification/generated/array_api.nonzero.html#nonzero

Apart from adding array API support, I would rather not change the code unless we identify a good reason to do it (in particular if we identify a user-facing impact). Personally, I am not convinced that one is particularly more readable than the other.

That being said, I would be curious to understand better why np.where(condition) is discouraged in the first place: it's not clear from the NumPy doc and my quick experiment above failed to clarify the issue.

lesteve · 2024-12-05T08:44:19Z

Maybe the note about the issue with subclasses is actually a reference to using np.where with a masked array: numpy/numpy#11425 (comment).

lorentzenchr · 2024-12-10T06:38:49Z

@eric-wieser @seberg friendly ping. We would like to better understand the numpy recommendation of using np.asarray(x).nonzero() over np.where(x).

Personally, I would follow numpy‘s recommendation without questioning it too much: their realm, their expertise.

seberg · 2024-12-10T09:00:01Z

where skips the .nonzero() attribute, so if a subclass defines that it is different (maybe only then). Of course these days the subclass could define __array_wrap__ so it is less of a thing maybe.
I never loved the overload of where to have two meanings, although I bet "where" sounds nicer to many than nonzero.

StefanieSenger added Documentation Needs Triage Issue requires triage labels Dec 3, 2024

StefanieSenger changed the title ~~Finding indexes with np.where(cond) or np.asarray(condition).nonzero()~~ Finding indexes with np.where(condition) or np.asarray(condition).nonzero() Dec 3, 2024

ogrisel removed the Needs Triage Issue requires triage label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding indexes with `np.where(condition)` or `np.asarray(condition).nonzero()` #30400

Finding indexes with `np.where(condition)` or `np.asarray(condition).nonzero()` #30400

StefanieSenger commented Dec 3, 2024 •

edited

Loading

ogrisel commented Dec 3, 2024 •

edited

Loading

lesteve commented Dec 5, 2024

lorentzenchr commented Dec 10, 2024

seberg commented Dec 10, 2024

Finding indexes with np.where(condition) or np.asarray(condition).nonzero() #30400

Finding indexes with np.where(condition) or np.asarray(condition).nonzero() #30400

Comments

StefanieSenger commented Dec 3, 2024 • edited Loading

ogrisel commented Dec 3, 2024 • edited Loading

lesteve commented Dec 5, 2024

lorentzenchr commented Dec 10, 2024

seberg commented Dec 10, 2024

Finding indexes with `np.where(condition)` or `np.asarray(condition).nonzero()` #30400

Finding indexes with `np.where(condition)` or `np.asarray(condition).nonzero()` #30400

StefanieSenger commented Dec 3, 2024 •

edited

Loading

ogrisel commented Dec 3, 2024 •

edited

Loading