-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vitaliset, a first pass on the PR.
|
||
|
||
def test_nan_euclidean_support(): | ||
# Test input containing NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Test input containing NaN. | |
"""Check that the different neighbor estimators are lenient towards `nan` | |
values if using `metric="nan_euclidean". | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also make sure to check the output of predict. I would also add an additional test to check what happens if we have a full sample with nan
values to see how it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @glemaitre, thanks for the review! :D
Regarding your last comment, the all nan
input leads to something similar to constant X
(getting the first indexes as nearest neighbors) and being independent of weights
(I was expecting something to happen when we weighed by distance).
X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='uniform')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])
X = [[0, 0], [0, 0], [0, 0], [0, 0]]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])
Results are similar for other classifiers/estimators such as RadiusNeighborsClassifier
.
For KNeighborsTransformer
and RadiusNeighborsTransformer
we get:
model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=2)
model.fit_transform(X).toarray()
>>> array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
For LocalOutlierFactor
we get:
model = neighbors.LocalOutlierFactor(metric="nan_euclidean", n_neighbors=1)
model.fit_predict(X)
>>> array([1, 1, 1, 1])
No errors are being released in any case. I think this makes sense for the estimators/neighbors search, but I'm unsure if we want this, especially in transformers. What do you think?
On the other hand, pairwise_distances
' family of functions gives us nan distances:
pairwise_distances(X, X, metric="nan_euclidean")
>>> array([[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan],
[nan, nan, nan, nan]])
pairwise_distances_argmin_min(X, X, metric="nan_euclidean")
>>> (array([0, 0, 0, 0], dtype=int64), array([nan, nan, nan, nan]))
Do you want me to do something about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless possible new asserts/tests, like the one suggested here, I made the revisions you instructed, so I'm requesting a new review. :D Thanks for the first comments.
Co-authored-by: Guillaume Lemaitre <[email protected]>
…cikit-learn into nan_euclidean_nn_bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. I would still add a test for the full constant feature to ensure that we don't change the behaviour in the future.
Co-authored-by: Guillaume Lemaitre <[email protected]>
# (neighbors.RadiusNeighborsRegressor, {}), | ||
# (neighbors.RadiusNeighborsClassifier, {}), | ||
(neighbors.NearestCentroid, {}), | ||
# (neighbors.KNeighborsTransformer, {"n_neighbors": 2}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new comments, @glemaitre. I tried to address them in my last commit.
Regarding the test for empty inputs, when I tested it the first time, I must have done something wrong in the notebook and didn't realize that the behavior is a bit strange for some of the classes:
from sklearn import __version__
__version__
>>> '1.3.dev0'
import numpy as np
from sklearn import neighbors
X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]
print(neighbors.RadiusNeighborsRegressor(metric='nan_euclidean').fit(X, y).predict(X))
>>> [-2147483648 -2147483648 -2147483648 -2147483648]
print(neighbors.RadiusNeighborsClassifier(metric="nan_euclidean").fit(X, y).predict(X))
>>> ValueError: No neighbors found for test samples array([0, 1, 2, 3], dtype=int64), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.
for n in range(1, 5):
model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=n)
print(f"n_neighbors={n}")
print(model.fit_transform(X).toarray())
>>> n_neighbors=1
>>> [[nan nan 0. 0.]
[nan nan 0. 0.]
[nan nan 0. 0.]
[nan nan 0. 0.]]
>>> n_neighbors=2
>>> [[nan nan nan 0.]
[nan nan nan 0.]
[nan nan nan 0.]
[nan nan nan 0.]]
>>> n_neighbors=3
>>> [[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]
>>> n_neighbors=4
>>> ValueError: Expected n_neighbors <= n_samples, but n_samples = 4, n_neighbors = 5
# Note that n_samples = 4 and actually n_neighbors should be = 4 also
It seems that RadiusNeighborsRegressor
and RadiusNeighborsClassifier
cannot find anyone with a distance less than the radius (and they behave differently from each other). As for KNeighborsTransformer
, I don't know how to explain what's happening... 😵
What should we do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think that it would be fine to raise an error for this case but I don't know we can easily detect it.
I solved the conflicts. I'm not sure anymore that we should care about the constant case because if everything is constant in X then, this would be an undefined behaviour. So I think that by already checking the way the pairwise distance work would be good enough. |
@glemaitre I added the |
Thanks @adrinjalali and @glemaitre for updating this PR with the latest main changes and for reviewing this PR. :) |
Yep, it is cleaner using the tag in this manner. |
This PR fixes #25319.
As suggested by @glemaitre, I changed the
X, y
validation of._fit
and then of.kneighbors
and.radius_neighbors
whenmetric="nan_euclidean"
ofRadiusNeighborsMixin
,KNeighborsMixin
,NeighborsBase
. Consequently, changing the behavior of its heritage (KNeighborsTransformer
,RadiusNeighborsTransformer
,KNeighborsClassifier
,RadiusNeighborsClassifier
,LocalOutlierFactor
,KNeighborsRegressor
,RadiusNeighborsRegressor
,NearestNeighbors
).I also updated the
NearestCentroid
class to follow this new validation. To make it work I had to change the validation ofsklearn.metrics.pairwise_distances_argmin
andsklearn.metrics.pairwise_distances_argmin_min
as well (updating the docs now that it supportsmetric=nan_euclidean'
).As
KernelDensity
uses kd_tree or ball_tree to build index:scikit-learn/sklearn/neighbors/_kde.py
Line 48 in 98cf537
It does not support
metrics='nan_euclidean'
, and I made no changes to it.Also, I added a test with the code used to report the issue by checking the behavior of the above classes.