-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Array API support for confusion_matrix #30440
base: main
Are you sure you want to change the base?
ENH Array API support for confusion_matrix #30440
Conversation
sklearn/metrics/_classification.py
Outdated
if need_index_conversion: | ||
label_to_ind = {y: x for x, y in enumerate(labels)} | ||
y_pred = np.array([label_to_ind.get(x, n_labels + 1) for x in y_pred]) | ||
y_true = np.array([label_to_ind.get(x, n_labels + 1) for x in y_true]) | ||
y_pred = xp.asarray( | ||
[label_to_ind.get(x, n_labels + 1) for x in y_pred], device=device_ | ||
) | ||
y_true = xp.asarray( | ||
[label_to_ind.get(x, n_labels + 1) for x in y_true], device=device_ | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block within the if need_index_conversion
condition is only tested for ndarrays, because of the way our tests are written. It should work for the other array libraries that we currently integrate, but I feel we should in fact test this part of the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails in label_to_ind = {y: x for x, y in enumerate(labels)}
for array_api_strict, because these array elements are not hashable.
I will try to fix this (possibly by re-factoring, don't spoiler) and add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have asked here and they have explained me how to deal with it. It is fixed and I have added a test.
(I have also tried to re-factor, but not successfully yet as what I tried is much slower than then status quo and at least in this PR, I won't further try, since we have a working solution.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily for this PR but maybe we could use unique_inverse
something like this would maybe work:
_, y_pred = xp.unique_inverse(y_pred)
Maybe a clearer example with array-api-strict:
In [1]: import array_api_strict
...: arr = array_api_strict.asarray([1, 3, 4, 1, 3, 5])
...: unique_values, labels = array_api_strict.unique_inverse(arr)
...: labels
...:
Out[1]: Array([0, 1, 2, 0, 1, 3], dtype=array_api_strict.int64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking a bit further, it looks even like we have sklearn.utils._encode._unique(return_inverse=True)
that seems to have array API support already ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that looks pretty good. I will try it tomorrow. :)
for true, pred, weight in zip(y_true, y_pred, sample_weight): | ||
cm[true, pred] += weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this performant enough? I think it could be, because we are mostly dealing with small matrices at this point. But finding another way might be better. I am not sure how to do this with the tools available in array_api_strict though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that this loop will kill any computational benefit of array API support. We might as well ensure that y_true
and y_pred
are numpy arrays using _convert_to_numpy
and rely on the coo_matrix
trick instead. This would keep the code simpler.
That being said, I think convenience array API support for classification metrics that rely on confusion matrix internally is useful as discussed in #30439 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems like python loops do not go well with GPUs. However there doesn't seem to be an alternative with the array api because it doesn't support any sort of advanced indexing.
So either we might have to use the loop if we insist on following the array api or we could simply use the original code by utilizing the _convert_to_numpy as @ogrisel suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this to check both options:
import timeit
code_to_test = """
import torch
from sklearn.metrics import confusion_matrix
from sklearn.base import config_context
xp=torch
y_true = xp.randint(0, 2, (2**23,), dtype=xp.int32)
y_pred = xp.randint(0, 2, (2**23,), dtype=xp.int32)
with config_context(array_api_dispatch=True):
confusion_matrix(y_true,y_pred)
"""
execution_time = timeit.timeit(code_to_test, number=1)
print(execution_time)
The results are pretty clear, I am really surprised about the difference:
with python loop:
115.27976658102125
with _convert_to_numpy:
2.0403239529696293
The code for the python loop is the current state of this PR.
The code for the _convert_to_numpy version is in an extra branch: here
So, I think this speaks for the _convert_to_numpy
option?
Edit: This was tested on cpu and I will next test it in Colab for cpu-cpu conversion vs. gpu-cpu conversion.
else: | ||
cm = xp.zeros((n_labels, n_labels), dtype=dtype, device=device_) | ||
for true, pred, weight in zip(y_true, y_pred, sample_weight): | ||
cm[true, pred] += weight | ||
|
||
with np.errstate(all="ignore"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I did a quick research if any of the other array libraries (other than numpy) would raise these warnings for divisions by zero as well. Result: it doesn't seem they do.
But just to be sure, is there any need to handle these warning for any other array library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any standardization of warnings and exceptions in the array API standard at this point unfortunately:
https://data-apis.org/array-api/latest/design_topics/exceptions.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so running pytest sklearn/metrics/tests/test_common.py -W error::RuntimeWarning
while out commenting the new test for confusion_matrix
already raises a lot of other warnings from other tests.
Do I assume right, that we don't so anything about it at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @StefanieSenger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing, @OmarManzoor.
I have implemented your suggestions. Would you mind having another look?
(Currently still working on it: I found some problem. So no rush.) --> Edit: it's resolved.
sklearn/utils/_array_api.py
Outdated
if xp is None: | ||
xp, _ = get_namespace(array, xp=xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if xp is None: | |
xp, _ = get_namespace(array, xp=xp) | |
xp, _ = get_namespace(array, xp=xp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you @OmarManzoor and @ogrisel. I have deleted the condition.
But I'm wondering: Why we allow xp
as an input into the helper functions in this file and then still get xp
anew using get_namespace()
. It seems redundant and maybe it could also hide subtle bugs in the case where our array has changed its namespace during the course of the execution of the actual function (the metric in this case) and we think we pass a different one into this function using xp=xp
than we actually do, but in fact the array's namespace gets re-evaluated and corrected and it's true namespace is used. I hope I managed to express what I mean here.
Maybe we should not allow the helper functions to have an xp passed or we don't re-evaluate by getting the namespace anew, just so that it's always definite and easily visible which namespace is used?
Reference Issues/PRs
towards #26024
What does this implement/fix? Explain your changes.
This PR aims to add Array API support to
confusion_matrix()
. I have run the CUDA tests on Colab and they too, pass.@OmarManzoor @ogrisel @lesteve: do you want to have a look?