Skip to content

DOC: Add clarifications np.argpartition#26716

Merged
mattip merged 4 commits into
numpy:mainfrom
JuliaPoo:cont-15128-topk-feat
Jul 1, 2024
Merged

DOC: Add clarifications np.argpartition#26716
mattip merged 4 commits into
numpy:mainfrom
JuliaPoo:cont-15128-topk-feat

Conversation

@JuliaPoo

@JuliaPoo JuliaPoo commented Jun 17, 2024

Copy link
Copy Markdown
Contributor

Linking the PR for np.top_k #26666, since it is helpful for the docs for np.top_k to specify explicitly the treatment of np.nan and the stability of the returned indices, and the behaviour of np.top_k relies on the implementation of np.argpartition, I suggest adding similar clarifications to np.argpartition docs.

With regards to the treatment of np.nan, currently np.argpartition does not explicitly handle np.nan and by the nature of its implementation, np.nan is treated like np.inf. This seems like something that can change in the future so I wrote that the treatment of np.nan is undefined. If this is not desirable, perhaps one can consider adding a np.nanargpartition and np.nanpartition variant.

Making this change also allows np.top_k to justly say that it follows np.argpartition's specifications.

Comment thread numpy/_core/fromnumeric.py Outdated
is unstable, and hence the returned indices are not guaranteed
to be the earliest/latest occurrence of the element.

The treatment of ``np.nan`` in the input array is undefined.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to comment on the other issue also, but I don't think this is right. We sort them to the end and I am not aware of any plan to change that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Doesn't look like it to me:

>>> x = np.array([1, 3, 2, np.nan, 4, 0, np.nan, 3.5])
>>> np.partition(x, 1)
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])
>>> x[np.argpartition(x, 1)]
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])

Without having looked at how it's implemented under the hood, it looks to me like there are no guarantees at least the unsorted elements to the right of the k-th element.

In addtion, this looks like a bug:

>>> np.partition(x, 5)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 6)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 7)  # Incorrect result when `kth == x.size-1`
array([1. , 3. , 2. , nan, 4. , 0. , 3.5, nan])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am sure as your last examples proof: The NaNs are all after or at the partition position.

And the last one is perfectly correct, since partition(x, 7) means that only the last entry (position 7) has any defined order at all.

@rgommers rgommers Jun 18, 2024

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we're talking past each other - "after the partition position" isn't the same as "we sort them to the end". The former looks correct indeed.

So the documented behavior should be something like "nan are treated as the largest possible values, larger than inf even - they always end up to the right of the kth partition element (unless there are more nan's in the array than fit to the right of kth)."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, "unspecified" would mean that they could occur completely randomly in the result.
The fact that they are random within each half is already clearly specified by partition itself! So yes, I think saying that their sort-order sorts them to the end is sufficient (which should be defined in sort, but maybe is not very clearly defined there).

But, if you prefer you can also add the longer explanation, it might just become incorrect conceptually when thinking about sorting in reverse order.
(I.e. I still think that the sort-order being defined to sort them to the end is the better specification, although maybe not the easier explanation)

@JuliaPoo

Copy link
Copy Markdown
Contributor Author

I've added a note about the sort order of np.nan (that its greater than np.inf). Sometimes np.partition will default to sorting which has sort order of np.nan defined so I was a bit suspicious it is indeed defined for np.partition, but from my own testing it does seem to be.

I've also added some tests to verify this sort order.

Comment thread numpy/_core/tests/test_multiarray.py Outdated
Comment thread numpy/_core/fromnumeric.py Outdated
Comment thread numpy/_core/fromnumeric.py Outdated

The sort order of ``np.nan`` is bigger than ``np.inf``.

See `partition` for notes on the different selection algorithms.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Aren't we in the partition function docstring? Does this docstring show up in a different function as well?

@JuliaPoo JuliaPoo Jul 1, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort order is defined in the docstring for np.sort, but since np.partition uses a different implementation, I thought it would be helpful to also define the sort order of np.partition.

I've removed notes regarding the sort order from np.argpartition and instead linked it to the docs for np.partition in the latest commit, this follows what np.sort and np.argsort did.

@mattip mattip merged commit c21ac10 into numpy:main Jul 1, 2024
@mattip

mattip commented Jul 1, 2024

Copy link
Copy Markdown
Member

Thanks @JuliaPoo

@rgommers rgommers added this to the 2.1.0 release milestone Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants