Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Array API support for confusion_matrix #30440

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

StefanieSenger
Copy link
Contributor

@StefanieSenger StefanieSenger commented Dec 9, 2024

Reference Issues/PRs

towards #26024

What does this implement/fix? Explain your changes.

This PR aims to add Array API support to confusion_matrix(). I have run the CUDA tests on Colab and they too, pass.

@OmarManzoor @ogrisel @lesteve: do you want to have a look?

Copy link

github-actions bot commented Dec 9, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 49f75b7. Link to the linter CI: here

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
Comment on lines 377 to 384
if need_index_conversion:
label_to_ind = {y: x for x, y in enumerate(labels)}
y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred])
y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true])
y_pred = xp.asarray(
[label_to_ind.get(x, n_labels + 1) for x in y_pred], device=device_
)
y_true = xp.asarray(
[label_to_ind.get(x, n_labels + 1) for x in y_true], device=device_
)
Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block within the if need_index_conversion condition is only tested for ndarrays, because of the way our tests are written. It should work for the other array libraries that we currently integrate, but I feel we should in fact test this part of the code?

Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails in label_to_ind = {y: x for x, y in enumerate(labels)} for array_api_strict, because these array elements are not hashable.

I will try to fix this (possibly by re-factoring, don't spoiler) and add a test.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have asked here and they have explained me how to deal with it. It is fixed and I have added a test.

(I have also tried to re-factor, but not successfully yet as what I tried is much slower than then status quo and at least in this PR, I won't further try, since we have a working solution.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily for this PR but maybe we could use unique_inverse

something like this would maybe work:

_, y_pred = xp.unique_inverse(y_pred)

Maybe a clearer example with array-api-strict:

In [1]: import array_api_strict
   ...: arr = array_api_strict.asarray([1, 3, 4, 1, 3, 5])
   ...: unique_values, labels = array_api_strict.unique_inverse(arr)
   ...: labels
   ...: 
Out[1]: Array([0, 1, 2, 0, 1, 3], dtype=array_api_strict.int64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking a bit further, it looks even like we have sklearn.utils._encode._unique(return_inverse=True) that seems to have array API support already ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that looks pretty good. I will try it tomorrow. :)

Comment on lines +408 to +409
for true, pred, weight in zip(y_true, y_pred, sample_weight):
cm[true, pred] += weight
Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this performant enough? I think it could be, because we are mostly dealing with small matrices at this point. But finding another way might be better. I am not sure how to do this with the tools available in array_api_strict though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that this loop will kill any computational benefit of array API support. We might as well ensure that y_true and y_pred are numpy arrays using _convert_to_numpy and rely on the coo_matrix trick instead. This would keep the code simpler.

That being said, I think convenience array API support for classification metrics that rely on confusion matrix internally is useful as discussed in #30439 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems like python loops do not go well with GPUs. However there doesn't seem to be an alternative with the array api because it doesn't support any sort of advanced indexing.

So either we might have to use the loop if we insist on following the array api or we could simply use the original code by utilizing the _convert_to_numpy as @ogrisel suggested.

Copy link
Contributor Author

@StefanieSenger StefanieSenger Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this to check both options:

import timeit

code_to_test = """
import torch
from sklearn.metrics import confusion_matrix
from sklearn.base import config_context

xp=torch

y_true = xp.randint(0, 2, (2**23,), dtype=xp.int32)
y_pred = xp.randint(0, 2, (2**23,), dtype=xp.int32)
with config_context(array_api_dispatch=True):
    confusion_matrix(y_true,y_pred)
"""

execution_time = timeit.timeit(code_to_test, number=1)
print(execution_time)

The results are pretty clear, I am really surprised about the difference:

with python loop:
115.27976658102125

with _convert_to_numpy:
2.0403239529696293

The code for the python loop is the current state of this PR.
The code for the _convert_to_numpy version is in an extra branch: here

So, I think this speaks for the _convert_to_numpy option?

Edit: This was tested on cpu and I will next test it in Colab for cpu-cpu conversion vs. gpu-cpu conversion.

else:
cm = xp.zeros((n_labels, n_labels), dtype=dtype, device=device_)
for true, pred, weight in zip(y_true, y_pred, sample_weight):
cm[true, pred] += weight

with np.errstate(all="ignore"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I did a quick research if any of the other array libraries (other than numpy) would raise these warnings for divisions by zero as well. Result: it doesn't seem they do.

But just to be sure, is there any need to handle these warning for any other array library?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any standardization of warnings and exceptions in the array API standard at this point unfortunately:

https://data-apis.org/array-api/latest/design_topics/exceptions.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so running pytest sklearn/metrics/tests/test_common.py -W error::RuntimeWarning while out commenting the new test for confusion_matrix already raises a lot of other warnings from other tests.

Do I assume right, that we don't so anything about it at the moment?

@ogrisel ogrisel added the CUDA CI label Dec 9, 2024
@github-actions github-actions bot removed the CUDA CI label Dec 9, 2024
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @StefanieSenger

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing, @OmarManzoor.
I have implemented your suggestions. Would you mind having another look?

(Currently still working on it: I found some problem. So no rush.) --> Edit: it's resolved.

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
Comment on lines 1109 to 1110
if xp is None:
xp, _ = get_namespace(array, xp=xp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if xp is None:
xp, _ = get_namespace(array, xp=xp)
xp, _ = get_namespace(array, xp=xp)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you @OmarManzoor and @ogrisel. I have deleted the condition.

But I'm wondering: Why we allow xp as an input into the helper functions in this file and then still get xp anew using get_namespace(). It seems redundant and maybe it could also hide subtle bugs in the case where our array has changed its namespace during the course of the execution of the actual function (the metric in this case) and we think we pass a different one into this function using xp=xp than we actually do, but in fact the array's namespace gets re-evaluated and corrected and it's true namespace is used. I hope I managed to express what I mean here.

Maybe we should not allow the helper functions to have an xp passed or we don't re-evaluate by getting the namespace anew, just so that it's always definite and easily visible which namespace is used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants