ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

vitaliset · 2023-01-08T06:06:38Z

This PR fixes #25319.

As suggested by @glemaitre, I changed the X, y validation of ._fit and then of .kneighbors and .radius_neighbors when metric="nan_euclidean" of RadiusNeighborsMixin, KNeighborsMixin, NeighborsBase. Consequently, changing the behavior of its heritage (KNeighborsTransformer, RadiusNeighborsTransformer, KNeighborsClassifier, RadiusNeighborsClassifier, LocalOutlierFactor, KNeighborsRegressor, RadiusNeighborsRegressor, NearestNeighbors).

I also updated the NearestCentroid class to follow this new validation. To make it work I had to change the validation of sklearn.metrics.pairwise_distances_argmin and sklearn.metrics.pairwise_distances_argmin_min as well (updating the docs now that it supports metric=nan_euclidean').

As KernelDensity uses kd_tree or ball_tree to build index:

scikit-learn/sklearn/neighbors/_kde.py

Line 48 in 98cf537

algorithm : {'kd_tree', 'ball_tree', 'auto'}, default='auto'

It does not support metrics='nan_euclidean', and I made no changes to it.

from sklearn.neighbors import VALID_METRICS
for key in VALID_METRICS.keys():
    print(f"'nan_euclidean' in {key}:", 'nan_euclidean' in VALID_METRICS[key])
>>> 'nan_euclidean' in ball_tree: False
>>> 'nan_euclidean' in kd_tree: False
>>> 'nan_euclidean' in brute: True

Also, I added a test with the code used to report the issue by checking the behavior of the above classes.

glemaitre

Thanks @vitaliset, a first pass on the PR.

doc/whats_new/v1.3.rst

sklearn/metrics/tests/test_pairwise.py

sklearn/neighbors/tests/test_neighbors.py

glemaitre · 2023-01-14T13:26:08Z

sklearn/neighbors/tests/test_neighbors.py

+
+
+def test_nan_euclidean_support():
+    # Test input containing NaN.


Suggested change

# Test input containing NaN.

"""Check that the different neighbor estimators are lenient towards `nan`

values if using `metric="nan_euclidean".

"""

We should also make sure to check the output of predict. I would also add an additional test to check what happens if we have a full sample with nan values to see how it fails.

Hello @glemaitre, thanks for the review! :D

Regarding your last comment, the all nan input leads to something similar to constant X (getting the first indexes as nearest neighbors) and being independent of weights (I was expecting something to happen when we weighed by distance).

X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]] y = [1, 2, 3, 4] model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='uniform') model.fit(X, y).predict(X) >>> array([1, 1, 1, 1]) model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance') model.fit(X, y).predict(X) >>> array([1, 1, 1, 1]) X = [[0, 0], [0, 0], [0, 0], [0, 0]] model = neighbors.KNeighborsClassifier(metric="nan_euclidean", n_neighbors=3, weights='distance') model.fit(X, y).predict(X) >>> array([1, 1, 1, 1])

Results are similar for other classifiers/estimators such as RadiusNeighborsClassifier.

For KNeighborsTransformer and RadiusNeighborsTransformer we get:

model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=2) model.fit_transform(X).toarray() >>> array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]])

For LocalOutlierFactor we get:

model = neighbors.LocalOutlierFactor(metric="nan_euclidean", n_neighbors=1) model.fit_predict(X) >>> array([1, 1, 1, 1])

No errors are being released in any case. I think this makes sense for the estimators/neighbors search, but I'm unsure if we want this, especially in transformers. What do you think?

On the other hand, pairwise_distances' family of functions gives us nan distances:

pairwise_distances(X, X, metric="nan_euclidean") >>> array([[nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan], [nan, nan, nan, nan]]) pairwise_distances_argmin_min(X, X, metric="nan_euclidean") >>> (array([0, 0, 0, 0], dtype=int64), array([nan, nan, nan, nan]))

Do you want me to do something about this?

Unless possible new asserts/tests, like the one suggested here, I made the revisions you instructed, so I'm requesting a new review. :D Thanks for the first comments.

Co-authored-by: Guillaume Lemaitre <[email protected]>

…cikit-learn into nan_euclidean_nn_bug

glemaitre

It looks good. I would still add a test for the full constant feature to ensure that we don't change the behaviour in the future.

doc/whats_new/v1.3.rst

sklearn/metrics/tests/test_pairwise.py

sklearn/neighbors/tests/test_neighbors.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

vitaliset · 2023-02-23T03:45:03Z

sklearn/neighbors/tests/test_neighbors.py

+        # (neighbors.RadiusNeighborsRegressor, {}),
+        # (neighbors.RadiusNeighborsClassifier, {}),
+        (neighbors.NearestCentroid, {}),
+        # (neighbors.KNeighborsTransformer, {"n_neighbors": 2}),


Thanks for the new comments, @glemaitre. I tried to address them in my last commit.

Regarding the test for empty inputs, when I tested it the first time, I must have done something wrong in the notebook and didn't realize that the behavior is a bit strange for some of the classes:

from sklearn import __version__ __version__ >>> '1.3.dev0' import numpy as np from sklearn import neighbors X = [[np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan], [np.nan, np.nan]] y = [1, 2, 3, 4] print(neighbors.RadiusNeighborsRegressor(metric='nan_euclidean').fit(X, y).predict(X)) >>> [-2147483648 -2147483648 -2147483648 -2147483648] print(neighbors.RadiusNeighborsClassifier(metric="nan_euclidean").fit(X, y).predict(X)) >>> ValueError: No neighbors found for test samples array([0, 1, 2, 3], dtype=int64), you can try using larger radius, giving a label for outliers, or considering removing them from your dataset. for n in range(1, 5): model = neighbors.KNeighborsTransformer(metric="nan_euclidean", n_neighbors=n) print(f"n_neighbors={n}") print(model.fit_transform(X).toarray()) >>> n_neighbors=1 >>> [[nan nan 0. 0.] [nan nan 0. 0.] [nan nan 0. 0.] [nan nan 0. 0.]] >>> n_neighbors=2 >>> [[nan nan nan 0.] [nan nan nan 0.] [nan nan nan 0.] [nan nan nan 0.]] >>> n_neighbors=3 >>> [[nan nan nan nan] [nan nan nan nan] [nan nan nan nan] [nan nan nan nan]] >>> n_neighbors=4 >>> ValueError: Expected n_neighbors <= n_samples, but n_samples = 4, n_neighbors = 5 # Note that n_samples = 4 and actually n_neighbors should be = 4 also

It seems that RadiusNeighborsRegressor and RadiusNeighborsClassifier cannot find anyone with a distance less than the radius (and they behave differently from each other). As for KNeighborsTransformer, I don't know how to explain what's happening... 😵

What should we do?

Personally, I think that it would be fine to raise an error for this case but I don't know we can easily detect it.

github-actions · 2024-05-21T12:11:48Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: c69e867. Link to the linter CI: here}

glemaitre · 2024-05-21T13:21:04Z

I solved the conflicts. I'm not sure anymore that we should care about the constant case because if everything is constant in X then, this would be an undefined behaviour.

So I think that by already checking the way the pairwise distance work would be good enough.

adrinjalali · 2024-11-01T14:39:01Z

@glemaitre I added the allow_nan tag here. So I'll let you review and merge.

vitaliset · 2024-11-01T17:41:21Z

Thanks @adrinjalali and @glemaitre for updating this PR with the latest main changes and for reviewing this PR. :)

glemaitre · 2024-11-05T18:54:37Z

Yep, it is cleaner using the tag in this manner.

fix nan_euclidean bug

00b174a

github-actions bot added module:metrics module:neighbors labels Jan 8, 2023

vitaliset and others added 4 commits January 8, 2023 03:24

black updates

4475e07

extra space and liting problems

5de72ce

changelog

46ca5ba

Merge branch 'main' into nan_euclidean_nn_bug

8eb1cb4

vitaliset changed the title ~~[WIP] FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values~~ FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Jan 10, 2023

Merge branch 'main' into nan_euclidean_nn_bug

ba191de

glemaitre self-requested a review January 14, 2023 13:14

glemaitre reviewed Jan 14, 2023

View reviewed changes

vitaliset and others added 5 commits February 3, 2023 02:14

Apply suggestions from code review

1869bd1

Co-authored-by: Guillaume Lemaitre <[email protected]>

apply suggestions from glemaitre

d31e50e

Merge branch 'main' into nan_euclidean_nn_bug

90a79be

linting

cd852a9

Merge branch 'nan_euclidean_nn_bug' of https://github.com/vitaliset/s…

ae17318

…cikit-learn into nan_euclidean_nn_bug

vitaliset requested a review from glemaitre February 4, 2023 19:49

vitaliset changed the title ~~FIX NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values~~ ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values Feb 17, 2023

glemaitre reviewed Feb 20, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

sklearn/metrics/tests/test_pairwise.py Outdated Show resolved Hide resolved

sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved

sklearn/neighbors/tests/test_neighbors.py Outdated Show resolved Hide resolved

vitaliset and others added 3 commits February 23, 2023 00:22

apply suggestions from glemaitre

e6fce35

Co-authored-by: Guillaume Lemaitre <[email protected]>

linting problem - black

1ff99c8

missing comma

5a28b6e

vitaliset commented Feb 23, 2023

View reviewed changes

vitaliset mentioned this pull request Jun 7, 2023

KNeighborsRegressor with metric="nan_euclidean" does not actually support NaN values #25319

Closed

glemaitre self-requested a review January 9, 2024 09:45

Merge remote-tracking branch 'origin/main' into pr/vitaliset/25330

f958579

glemaitre added 2 commits May 21, 2024 14:16

TST make sure to test what we intend to

c714378

remove test

e2d2f5f

glemaitre added this to the 1.6 milestone May 21, 2024

glemaitre approved these changes May 21, 2024

View reviewed changes

vitaliset and others added 5 commits May 21, 2024 11:38

Update typo on v1.6.rst

c9752ff

Merge remote-tracking branch 'upstream/main' into nan_euclidean_nn_bug

2e24b03

changelog

faff4f7

changelog fix

8cbacad

use tag

2009b55

adrinjalali approved these changes Nov 1, 2024

View reviewed changes

adrinjalali added 3 commits November 1, 2024 15:36

fix tags

b549aaa

Merge remote-tracking branch 'upstream/main' into nan_euclidean_nn_bug

18a79e4

authorship

c69e867

glemaitre merged commit d2f1ea7 into scikit-learn:main Nov 5, 2024
30 checks passed

adrinjalali mentioned this pull request Nov 6, 2024

sklearn.neighbors.NearestNeighbors allow processing nan values #29085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

vitaliset commented Jan 8, 2023 •

edited

Loading

glemaitre left a comment

glemaitre Jan 14, 2023

glemaitre Jan 14, 2023

vitaliset Feb 4, 2023 •

edited

Loading

vitaliset Feb 4, 2023

glemaitre left a comment

vitaliset Feb 23, 2023

glemaitre Feb 24, 2023

github-actions bot commented May 21, 2024 •

edited

Loading

glemaitre commented May 21, 2024

adrinjalali commented Nov 1, 2024

vitaliset commented Nov 1, 2024

glemaitre commented Nov 5, 2024 •

edited

Loading



		def test_nan_euclidean_support():
		# Test input containing NaN.

-    # Test input containing NaN.
+    """Check that the different neighbor estimators are lenient towards `nan`
+    values if using `metric="nan_euclidean".
+    """

ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

ENH NearestNeighbors-like classes with metric="nan_euclidean" does not actually support NaN values #25330

Conversation

vitaliset commented Jan 8, 2023 • edited Loading

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Jan 14, 2023

Choose a reason for hiding this comment

glemaitre Jan 14, 2023

Choose a reason for hiding this comment

vitaliset Feb 4, 2023 • edited Loading

Choose a reason for hiding this comment

vitaliset Feb 4, 2023

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

vitaliset Feb 23, 2023

Choose a reason for hiding this comment

glemaitre Feb 24, 2023

Choose a reason for hiding this comment

github-actions bot commented May 21, 2024 • edited Loading

✔️ Linting Passed

glemaitre commented May 21, 2024

adrinjalali commented Nov 1, 2024

vitaliset commented Nov 1, 2024

glemaitre commented Nov 5, 2024 • edited Loading

vitaliset commented Jan 8, 2023 •

edited

Loading

vitaliset Feb 4, 2023 •

edited

Loading

github-actions bot commented May 21, 2024 •

edited

Loading

glemaitre commented Nov 5, 2024 •

edited

Loading