DOC: Add clarifications np.argpartition#26716
Conversation
| is unstable, and hence the returned indices are not guaranteed | ||
| to be the earliest/latest occurrence of the element. | ||
|
|
||
| The treatment of ``np.nan`` in the input array is undefined. |
There was a problem hiding this comment.
Going to comment on the other issue also, but I don't think this is right. We sort them to the end and I am not aware of any plan to change that.
There was a problem hiding this comment.
Are you sure? Doesn't look like it to me:
>>> x = np.array([1, 3, 2, np.nan, 4, 0, np.nan, 3.5])
>>> np.partition(x, 1)
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])
>>> x[np.argpartition(x, 1)]
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])Without having looked at how it's implemented under the hood, it looks to me like there are no guarantees at least the unsorted elements to the right of the k-th element.
In addtion, this looks like a bug:
>>> np.partition(x, 5)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 6)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 7) # Incorrect result when `kth == x.size-1`
array([1. , 3. , 2. , nan, 4. , 0. , 3.5, nan])There was a problem hiding this comment.
Yes, I am sure as your last examples proof: The NaNs are all after or at the partition position.
And the last one is perfectly correct, since partition(x, 7) means that only the last entry (position 7) has any defined order at all.
There was a problem hiding this comment.
Maybe we're talking past each other - "after the partition position" isn't the same as "we sort them to the end". The former looks correct indeed.
So the documented behavior should be something like "nan are treated as the largest possible values, larger than inf even - they always end up to the right of the kth partition element (unless there are more nan's in the array than fit to the right of kth)."
There was a problem hiding this comment.
Well, "unspecified" would mean that they could occur completely randomly in the result.
The fact that they are random within each half is already clearly specified by partition itself! So yes, I think saying that their sort-order sorts them to the end is sufficient (which should be defined in sort, but maybe is not very clearly defined there).
But, if you prefer you can also add the longer explanation, it might just become incorrect conceptually when thinking about sorting in reverse order.
(I.e. I still think that the sort-order being defined to sort them to the end is the better specification, although maybe not the easier explanation)
|
I've added a note about the sort order of I've also added some tests to verify this sort order. |
|
|
||
| The sort order of ``np.nan`` is bigger than ``np.inf``. | ||
|
|
||
| See `partition` for notes on the different selection algorithms. |
There was a problem hiding this comment.
I don't understand this comment. Aren't we in the partition function docstring? Does this docstring show up in a different function as well?
There was a problem hiding this comment.
The sort order is defined in the docstring for np.sort, but since np.partition uses a different implementation, I thought it would be helpful to also define the sort order of np.partition.
I've removed notes regarding the sort order from np.argpartition and instead linked it to the docs for np.partition in the latest commit, this follows what np.sort and np.argsort did.
Co-authored-by: Matti Picus <[email protected]>
|
Thanks @JuliaPoo |
Linking the PR for
np.top_k#26666, since it is helpful for the docs fornp.top_kto specify explicitly the treatment ofnp.nanand the stability of the returned indices, and the behaviour ofnp.top_krelies on the implementation ofnp.argpartition, I suggest adding similar clarifications tonp.argpartitiondocs.With regards to the treatment of
np.nan, currentlynp.argpartitiondoes not explicitly handlenp.nanand by the nature of its implementation,np.nanis treated likenp.inf. This seems like something that can change in the future so I wrote that the treatment ofnp.nanis undefined. If this is not desirable, perhaps one can consider adding anp.nanargpartitionandnp.nanpartitionvariant.Making this change also allows
np.top_kto justly say that it followsnp.argpartition's specifications.