ENH Array API support for confusion_matrix #30440

StefanieSenger · 2024-12-09T09:33:04Z

Reference Issues/PRs

towards #26024

What does this implement/fix? Explain your changes.

This PR aims to add Array API support to confusion_matrix(). I have run the CUDA tests on Colab and they too, pass.

@OmarManzoor @ogrisel @lesteve: do you want to have a look?

github-actions · 2024-12-09T09:34:24Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 49f75b7. Link to the linter CI: here}

StefanieSenger · 2024-12-09T09:36:07Z

sklearn/metrics/_classification.py

-            return np.zeros((n_labels, n_labels), dtype=int)
-        elif len(np.intersect1d(y_true, labels)) == 0:
+            return xp.zeros((n_labels, n_labels), dtype=xp.int64, device=device_)
+        elif not xp.isin(labels, y_true).any():


This line is currently only tested for ndarrays.

xp.isin() is not existing in array_api_strict and I am trying to find an alternative way that also works in the strict definition.

Can you use sklearn.utils._array_api._isin? Looking at isin status in array API I found data-apis/array-api#854 that mentions scikit-learn has an implementation of it.

Oh yes, I can! ✨
Thanks for the suggestion.

I guess the fact that it was not failing with xp.isin makes me think that some additional tests would be needed to test this part of the code with array API (non numpy) inputs

I will add a test.

StefanieSenger · 2024-12-09T09:38:18Z

sklearn/metrics/_classification.py

    if need_index_conversion:
        label_to_ind = {y: x for x, y in enumerate(labels)}
-        y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred])
-        y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])
+        y_pred = xp.asarray(
+            [label_to_ind.get(x, n_labels + 1) for x in y_pred], device=device_
+        )
+        y_true = xp.asarray(
+            [label_to_ind.get(x, n_labels + 1) for x in y_true], device=device_
+        )


This code block within the if need_index_conversion condition is only tested for ndarrays, because of the way our tests are written. It should work for the other array libraries that we currently integrate, but I feel we should in fact test this part of the code?

It fails in label_to_ind = {y: x for x, y in enumerate(labels)} for array_api_strict, because these array elements are not hashable.

I will try to fix this (possibly by re-factoring, don't spoiler) and add a test.

StefanieSenger · 2024-12-09T09:39:46Z

sklearn/metrics/_classification.py

+        for true, pred, weight in zip(y_true, y_pred, sample_weight):
+            cm[true, pred] += weight


Is this performant enough? I think it could be, because we are mostly dealing with small matrices at this point. But finding another way might be better. I am not sure how to do this with the tools available in array_api_strict though.

I have the feeling that this loop will kill any computational benefit of array API support. We might as well ensure that y_true and y_pred are numpy arrays using _convert_to_numpy and rely on the coo_matrix trick instead. This would keep the code simpler.

That being said, I think convenience array API support for classification metrics that rely on confusion matrix internally is useful as discussed in #30439 (comment).

I agree that it seems like python loops do not go well with GPUs. However there doesn't seem to be an alternative with the array api because it doesn't support any sort of advanced indexing.

So either we might have to use the loop if we insist on following the array api or we could simply use the original code by utilizing the _convert_to_numpy as @ogrisel suggested.

StefanieSenger · 2024-12-09T09:44:50Z

sklearn/metrics/_classification.py

+    else:
+        cm = xp.zeros((n_labels, n_labels), dtype=dtype, device=device_)
+        for true, pred, weight in zip(y_true, y_pred, sample_weight):
+            cm[true, pred] += weight

    with np.errstate(all="ignore"):


Here, I did a quick research if any of the other array libraries (other than numpy) would raise these warnings for divisions by zero as well. Result: it doesn't seem they do.

But just to be sure, is there any need to handle these warning for any other array library?

I don't think there is any standardization of warnings and exceptions in the array API standard at this point unfortunately:

https://data-apis.org/array-api/latest/design_topics/exceptions.html

OmarManzoor

Thanks for the PR @StefanieSenger

sklearn/metrics/_classification.py

StefanieSenger

Thank you for reviewing, @OmarManzoor.
I have implemented your suggestions. Would you mind having another look?

(Currently still working on it: I found some problem. So no rush.)

sklearn/metrics/_classification.py

OmarManzoor · 2024-12-10T14:36:29Z

sklearn/utils/_array_api.py

+def _nan_to_num(array, xp=None):
+    """Substitutes NaN values with 0 and inf values with the maximum or minimum
+    numbers available for the dtype respectively; like np.nan_to_num."""
+    if xp is None:


We don't really need this. get_namespace handles the fact that we already have an xp defined so it returns that.

Suggested change

if xp is None:

ogrisel · 2024-12-10T15:49:49Z

sklearn/utils/_array_api.py

+    if xp is None:
+        xp, _ = get_namespace(array, xp=xp)


Suggested change

if xp is None:

xp, _ = get_namespace(array, xp=xp)

xp, _ = get_namespace(array, xp=xp)

StefanieSenger added 3 commits December 7, 2024 23:51

ENH Array API for confusion_matrix

78d2a65

fix dtype checking

770e638

prepare for PR

af440ca

github-actions bot added the module:metrics label Dec 9, 2024

StefanieSenger commented Dec 9, 2024

View reviewed changes

StefanieSenger added 2 commits December 9, 2024 10:49

change log

b45646e

use our _isin

3db7054

ogrisel added the CUDA CI label Dec 9, 2024

github-actions bot removed the CUDA CI label Dec 9, 2024

OmarManzoor reviewed Dec 10, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

changes after review

abab5ea

StefanieSenger commented Dec 10, 2024

View reviewed changes

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved

forgot to push that before

abc3981

OmarManzoor reviewed Dec 10, 2024

View reviewed changes

ogrisel reviewed Dec 10, 2024

View reviewed changes

StefanieSenger added 3 commits December 11, 2024 15:42

add test

09cec5d

fix sclar dtype

fdb25f6

fix typos

49f75b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Array API support for confusion_matrix #30440

ENH Array API support for confusion_matrix #30440

StefanieSenger commented Dec 9, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024 •

edited

Loading

StefanieSenger Dec 9, 2024

lesteve Dec 9, 2024

StefanieSenger Dec 9, 2024

lesteve Dec 10, 2024

StefanieSenger Dec 10, 2024

StefanieSenger Dec 9, 2024 •

edited

Loading

StefanieSenger Dec 10, 2024 •

edited

Loading

StefanieSenger Dec 9, 2024 •

edited

Loading

ogrisel Dec 10, 2024

OmarManzoor Dec 11, 2024

StefanieSenger Dec 9, 2024

ogrisel Dec 10, 2024

OmarManzoor left a comment

StefanieSenger left a comment •

edited

Loading

OmarManzoor Dec 10, 2024

ogrisel Dec 10, 2024

		for true, pred, weight in zip(y_true, y_pred, sample_weight):
		cm[true, pred] += weight

	if xp is None:
	xp, _ = get_namespace(array, xp=xp)
	xp, _ = get_namespace(array, xp=xp)

ENH Array API support for confusion_matrix #30440

Are you sure you want to change the base?

ENH Array API support for confusion_matrix #30440

Conversation

StefanieSenger commented Dec 9, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

github-actions bot commented Dec 9, 2024 • edited Loading

✔️ Linting Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

StefanieSenger Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

StefanieSenger Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OmarManzoor left a comment

Choose a reason for hiding this comment

StefanieSenger left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger commented Dec 9, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024 •

edited

Loading

StefanieSenger Dec 9, 2024 •

edited

Loading

StefanieSenger Dec 10, 2024 •

edited

Loading

StefanieSenger Dec 9, 2024 •

edited

Loading

StefanieSenger left a comment •

edited

Loading