Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

Merged
merged 25 commits into from
Nov 5, 2024

Conversation

vitaliset
Copy link
Contributor

@vitaliset vitaliset commented Jan 8, 2023

This PR fixes #25319.

As suggested by @glemaitre, I changed the X, y validation of ._fit and then of .kneighbors and .radius_neighbors when metric="nan_euclidean" of RadiusNeighborsMixin, KNeighborsMixin, NeighborsBase. Consequently, changing the behavior of its heritage (KNeighborsTransformer, RadiusNeighborsTransformer, KNeighborsClassifier, RadiusNeighborsClassifier, LocalOutlierFactor, KNeighborsRegressor, RadiusNeighborsRegressor, NearestNeighbors).

I also updated the NearestCentroid class to follow this new validation. To make it work I had to change the validation of sklearn.metrics.pairwise_distances_argmin and sklearn.metrics.pairwise_distances_argmin_min as well (updating the docs now that it supports metric=nan_euclidean').

As KernelDensity uses kd_tree or ball_tree to build index:

algorithm : {'kd_tree', 'ball_tree', 'auto'}, default='auto'

It does not support metrics='nan_euclidean', and I made no changes to it.

from sklearn.neighbors import VALID_METRICS
for key in VALID_METRICS.keys():
    print(f"'nan_euclidean' in {key}:", 'nan_euclidean' in VALID_METRICS[key])
>>> 'nan_euclidean' in ball_tree: False
>>> 'nan_euclidean' in kd_tree: False
>>> 'nan_euclidean' in brute: True

Also, I added a test with the code used to report the issue by checking the behavior of the above classes.

@vitaliset vitaliset changed the title [WIP] FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Jan 10, 2023
@glemaitre glemaitre self-requested a review January 14, 2023 13:14
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vitaliset, a first pass on the PR.

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved


def test_nan_euclidean_support():
# Test input containing NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Test input containing NaN.
"""Check that the different neighbor estimators are lenient towards `nan`
values if using `metric="nan_euclidean".
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure to check the output of predict. I would also add an additional test to check what happens if we have a full sample with nan values to see how it fails.

Copy link
Contributor Author

@vitaliset vitaliset Feb 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @glemaitre, thanks for the review! :D

Regarding your last comment, the all nan input leads to something similar to constant X (getting the first indexes as nearest neighbors) and being independent of weights (I was expecting something to happen when we weighed by distance).

X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]

model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='uniform')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

X = [[0, 0], [0, 0], [0, 0], [0, 0]]
model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance')
model.fit(X, y).predict(X)
>>> array([1, 1, 1, 1])

Results are similar for other classifiers/estimators such as RadiusNeighborsClassifier.

For KNeighborsTransformer and RadiusNeighborsTransformer we get:

model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=2)
model.fit_transform(X).toarray()
>>> array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

For LocalOutlierFactor we get:

model = neighbors.LocalOutlierFactor(metric="nan_euclidean", n_neighbors=1)
model.fit_predict(X)
>>> array([1, 1, 1, 1])

No errors are being released in any case. I think this makes sense for the estimators/neighbors search, but I'm unsure if we want this, especially in transformers. What do you think?

On the other hand, pairwise_distances' family of functions gives us nan distances:

pairwise_distances(X, X, metric="nan_euclidean")
>>> array([[nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan],
       [nan, nan, nan, nan]])

pairwise_distances_argmin_min(X, X, metric="nan_euclidean")
>>> (array([0, 0, 0, 0], dtype=int64), array([nan, nan, nan, nan]))

Do you want me to do something about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless possible new asserts/tests, like the one suggested here, I made the revisions you instructed, so I'm requesting a new review. :D Thanks for the first comments.

@vitaliset vitaliset requested a review from glemaitre February 4, 2023 19:49
@vitaliset vitaliset changed the title FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Feb 17, 2023
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. I would still add a test for the full constant feature to ensure that we don't change the behaviour in the future.

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved
Comment on lines 2302 to 2305
# (neighbors.RadiusNeighborsRegressor, {}),
# (neighbors.RadiusNeighborsClassifier, {}),
(neighbors.NearestCentroid, {}),
# (neighbors.KNeighborsTransformer, {"n_neighbors": 2}),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new comments, @glemaitre. I tried to address them in my last commit.

Regarding the test for empty inputs, when I tested it the first time, I must have done something wrong in the notebook and didn't realize that the behavior is a bit strange for some of the classes:

from sklearn import __version__
__version__
>>> '1.3.dev0'

import numpy as np
from sklearn import neighbors

X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]]
y = [1, 2, 3, 4]

print(neighbors.RadiusNeighborsRegressor(metric='nan_euclidean').fit(X, y).predict(X))
>>> [-2147483648 -2147483648 -2147483648 -2147483648]
print(neighbors.RadiusNeighborsClassifier(metric="nan_euclidean").fit(X, y).predict(X))
>>> ValueError: No neighbors found for test samples array([0, 1, 2, 3], dtype=int64), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset.

for n in range(1, 5):
    model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=n)
    print(f"n_neighbors={n}")
    print(model.fit_transform(X).toarray())
>>> n_neighbors=1
>>> [[nan nan  0.  0.]
 [nan nan  0.  0.]
 [nan nan  0.  0.]
 [nan nan  0.  0.]]
>>> n_neighbors=2
>>> [[nan nan nan  0.]
 [nan nan nan  0.]
 [nan nan nan  0.]
 [nan nan nan  0.]]
>>> n_neighbors=3
>>> [[nan nan nan nan]
 [nan nan nan nan]
 [nan nan nan nan]
 [nan nan nan nan]]
>>> n_neighbors=4
>>> ValueError: Expected n_neighbors <= n_samples,  but n_samples = 4, n_neighbors = 5
# Note that n_samples = 4 and actually n_neighbors  should be = 4 also

It seems that RadiusNeighborsRegressor and RadiusNeighborsClassifier cannot find anyone with a distance less than the radius (and they behave differently from each other). As for KNeighborsTransformer, I don't know how to explain what's happening... 😵

What should we do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think that it would be fine to raise an error for this case but I don't know we can easily detect it.

Copy link

github-actions bot commented May 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: c69e867. Link to the linter CI: here

@glemaitre
Copy link
Member

I solved the conflicts. I'm not sure anymore that we should care about the constant case because if everything is constant in X then, this would be an undefined behaviour.

So I think that by already checking the way the pairwise distance work would be good enough.

@glemaitre glemaitre added this to the 1.6 milestone May 21, 2024
@adrinjalali
Copy link
Member

@glemaitre I added the allow_nan tag here. So I'll let you review and merge.

@vitaliset
Copy link
Contributor Author

Thanks @adrinjalali and @glemaitre for updating this PR with the latest main changes and for reviewing this PR. :)

@glemaitre
Copy link
Member

glemaitre commented Nov 5, 2024

Yep, it is cleaner using the tag in this manner.

@glemaitre glemaitre merged commit d2f1ea7 into scikit-learn:main Nov 5, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

KNeighborsRegressor with metric="nan_euclidean" does not actually support NaN values
3 participants