DOC improve documentation of copy=False in preprocessing functions #27691

konstantinos-p · 2023-10-31T12:59:51Z

Reference Issues/PRs

Example: Fixes #27307 (the issue was taken but had stalled)

What does this implement/fix? Explain your changes.

minmax_scale has an unexpected behavior when it is called with X of dtype other than float and copy=False. Specifically, even though copy=False, X is always first passed through the ~sklearn.utils.validation.check_array function. If the dtype of X is not float check_array casts it to float triggering a copy. Any subsequent changes on X happen on the copy and not in-place.

I propose to first change the docstring of minmax_scale to reflect this unexpected behaviour. I also propose a check based on two calls of check_array that outputs a warning when this behaviour is detected.

Any other comments?

Other functions in the _data.py file probably have the same behavior and need to be modified accordingly. I'd be happy to change these as well.

…loat and copy=False

konstantinos-p · 2023-10-31T13:00:59Z

@lesteve happy to have your thoughts!

github-actions · 2023-10-31T13:01:30Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 8324307. Link to the linter CI: here}

lesteve · 2023-10-31T13:37:25Z

Thanks for the PR!

I think the easiest thing to do is to only change the docstring, I would put something close to the copy parameter docstring taking as an example the StandardScaler copy parameter doc:

scikit-learn/sklearn/preprocessing/_data.py

Lines 710 to 714 in a5fed0d

    
               copy : bool, default=True 
        
                   If False, try to avoid a copy and do inplace scaling instead. 
        
                   This is not guaranteed to always work inplace; e.g. if the data is 
        
                   not a NumPy array or scipy.sparse CSR matrix, a copy may still be 
        
                   returned.

I am not sure whether adding a warning for this case is worth it or not ... if we decide it is worth it then I guess we would need to add test for it.

…dardScaler docstring

konstantinos-p · 2023-10-31T15:01:14Z

I basically copied the text from StandardScaler and added the case mentioned in Issue #27307. I've replicated the same error in

scale
maxabs_scale
robust_scale
normalize
binarize # doesn't copy on float but copies when the input is not an array
quantile_transform
power_transform

where the docstring for copy is also misleading. Once you are ok with the text I can add it to these functions as well, with some minor modifications.

lesteve · 2023-10-31T16:08:43Z

sklearn/preprocessing/_data.py

@@ -649,6 +651,7 @@ def minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True):
    X = check_array(
        X, copy=False, ensure_2d=False, dtype=FLOAT_DTYPES, force_all_finite="allow-nan"
    )
+


Can you remove the added newline here, in order to avoid adding unrelated changes?

lesteve · 2023-10-31T16:11:17Z

Once you are ok with the text I can add it to these functions as well, with some minor modifications.

I think your changes to the docstring look fine. I would wait for a second opinion before updating it in other places.

lesteve · 2023-11-06T08:44:04Z

Can you put it out of draft mode?

The CI error is unrelated.

sklearn/preprocessing/_data.py

betatim · 2023-11-06T13:08:06Z

Some wording tweak suggestions, otherwise the wording looks good and 👍 for applying it to all the similar locations you found.

Co-authored-by: Tim Head <[email protected]>

… binarize, quantile_transform, power_transform

…ng of the first sentence

geronimos · 2023-11-07T11:09:15Z

I've seen the related issue today because I also got confused about the Transformer's behavior while setting copy=False. I much appreciate this PR.

I would like to suggest being more specific about the data types restriction by saying

NumPy array of dtype float

According to the note

e.g. if the data is not a NumPy array, is scipy.sparse CSR matrix, or is not of dtype float, a copy may still be returned.

I'd conclude that the data should either be a NumPy array or of dtype float. However, a NumPy array of dtype int is also copied. I assume the condition is NumPy array and dtype float.

I relate to @lesteve comment in the issue:

copy=False only works if the input array dtype is a float dtype, i.e. float64, float32 or float16 right now. I guess maybe the documentation could be improved to mention this?

In your case the input array dtype is an int dtype.

@konstantinos-p @lesteve Is naming the restriction to dtypes applicable in your opinion?

lesteve · 2023-11-07T16:57:16Z

sklearn/preprocessing/_data.py

-        Set to False to perform inplace binarization and avoid a copy
-        (if the input is already a numpy array or a scipy.sparse CSR / CSC
-        matrix and if axis is 1).
+        If False, try to avoid a copy and scale in place.


I feel in this case (and maybe others) "scale" is not appropriate, so maybe stay closer to the original wording.

lesteve · 2023-11-07T17:02:23Z

sklearn/preprocessing/_data.py

-        (if the input is already a numpy array or a scipy.sparse CSR / CSC
-        matrix and if axis is 1).
+        If False, try to avoid a copy and scale in place.
+        This is not guaranteed to always work in place; e.g. if the data is


I think in this case (and maybe others) the previous wording is accurate that scipy CSR or CSC matrices are not copied, so you can not reuse the same doc here.

In general, the fact that a copy is made mostly depends on the check_array arguments, but the code following it may also need to be looked at closer.

I still think the previous docstring was accurate whether the new one is slightly misleading.

The code below is used, which means that it accepts CSR/CSC sparse matrix without copying it with copy=False:

X = check_array(X, accept_sparse=["csr", "csc"], copy=copy)

… performed by each function

…other small typos

konstantinos-p · 2023-11-08T18:16:59Z

Hello! I've made two commits. In the first, I changed the language for the binarize, quantile_transform, power_transform functions from "scale" to something that more accurately reflects what each function is doing. In the second, I fixed the examples that I give for when a copy occurs. I've checked that they are accurate. binarize returns a copy when it is called on something that is not a NumPy array. All the other functions, as an example, return a copy "if the data is not a NumPy array,or is a scipy.sparse CSR matrix, or is not of dtype float".

lesteve · 2023-11-14T08:57:30Z

FYI I have changed the title of your PR to remove the WIP and make it slighly more compact. This will also affect the commit message when we merge this.

lesteve · 2023-11-14T09:28:27Z

sklearn/preprocessing/_data.py

+        If False, try to avoid a copy and scale in place.
+        This is not guaranteed to always work in place; e.g. if the data is
+        not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
+        float, a copy may still be returned.


~~I think the previous docstring is more accurate (e.g. CSC matrix does not get copied with copy=False)~~

Edit: this is fine I read too quickly

lesteve · 2023-11-14T09:30:41Z

sklearn/preprocessing/_data.py

+        If False, try to avoid a copy and scale in place.
+        This is not guaranteed to always work in place; e.g. if the data is
+        not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
+        float, a copy may still be returned.


check_array(..., accept_sparse=("csr", "csc"), dtype=FLOAT_DTYPES) means there is no copy if the matrix is a CSC or CSR matrix and the dtype is of kind float

lesteve · 2023-11-14T09:52:27Z

sklearn/preprocessing/_data.py

+        If False, try to avoid a copy and scale in place.
+        This is not guaranteed to always work in place; e.g. if the data is
+        not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
+        float, a copy may still be returned.


Honestly I am started to think that aiming to be fully accurate in the docstring about the copy=False behaviour is not going to be workable. I would instead give 1 or 2 simple examples of case where data is copied even with copy=False.

For example in this case I would say something like "e.g. if the data is a Numpy array with a non-float dtype, a copy will be returned". Basically let's only look at the first check_array call and give an example where even with copy=False it will return a copy.

In this particular case the first check_array call accept csr/csc so no copy but then it depends on the RobustScaler, for example the CSR will be turned into CSC in RobustScaler.fit which may share some indices (so partial copy?) ...

There are two ways of formulating the docstring. 1) We present the cases where copy = False works exhaustively 2) we present examples where copy=False fails. I completely agree that the second approach is more feasible.

As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading. Is however the source of the copy relevant for the docstring?

I see two ways moving forward.

I create some sort of test that confirms copies are made in case x. For example, I've created a colab where I already check that a copy is made in all the cases that I've listed (even if these don't happen because of check_array, I'm not tracking the source). I can adapt it based on a proposed template.

I include in all dosctrings only the example data is not a NumPy array which triggers a copy through check_array in all the functions that are part of this PR. (This is basically also what you are proposing if I understand correctly)

I'd personally go for the second option (with maybe a more detailed entry for binarize)? Let me know what you think.

It feels like we are on the same line, i.e. giving one or two simple counter-examples, where copy=False still creates a copy so 2).

About ways for moving forward, I would say the testing approach seems too much work and not worth it.

So I would say:

use the simple example of the numpy array with a non-float dtype everywhere we can, i.e. when the first check_array call has dtype=FLOAT_DTYPES. This was the confusion in the original issue. More generally I am finding only now (sorry for realising this late) that "is not a Numpy array" is a bit unclear because a scipy sparse matrix is not a numpy array and yet in some cases scipy sparse matrix will not be copied.

treat the remaining cases with more care (I don't have a good view how many there are)

As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading

Just for completeness, I was trying to say that looking at the first check_array is enough to find a counter-example where copy=False still creates a copy.

Also I would use the wording and use "a copy will be returned" instead of "a copy may be returned" for the counter-example we give.

In some way it would be nice to find a way to convey that this is a example where copy=False still returns a copy but that there may be (plenty of) others.

lesteve · 2023-11-14T09:55:51Z

sklearn/preprocessing/_data.py

+        If False, try to avoid a copy and normalize in place.
+        This is not guaranteed to always work in place; e.g. if the data is
+        not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
+        float, a copy may still be returned.


I think this needs to be tweaked

… wrong dtypes

lesteve · 2023-12-07T12:54:33Z

LGTM, I set auto-merge on this PR

Change the docstring of minmax_scale + add warning when X dtype not f…

deca6b8

…loat and copy=False

github-actions bot added the module:preprocessing label Oct 31, 2023

Fix errors raised by black and ruff

4c2c4dc

Remove proposed warning, fix minmax_scale docstring according to Stan…

4bcc4f0

…dardScaler docstring

lesteve reviewed Oct 31, 2023

View reviewed changes

Remove spurious newline in minmax_scale

342401d

lesteve added the Quick Review For PRs that are quick to review label Nov 6, 2023

betatim reviewed Nov 6, 2023

View reviewed changes

sklearn/preprocessing/_data.py Outdated Show resolved Hide resolved

konstantinos-p and others added 3 commits November 6, 2023 14:37

🎨 Some wording tweaks sklearn/preprocessing/_data.py

3b4ef64

Co-authored-by: Tim Head <[email protected]>

Apply corrected text to scale, maxabs_scale, robust_scale, normalize,…

e928bdf

… binarize, quantile_transform, power_transform

Change inplace in the second sentence to in place to match the phrasi…

8776640

…ng of the first sentence

konstantinos-p marked this pull request as ready for review November 6, 2023 14:08

lesteve reviewed Nov 7, 2023

View reviewed changes

konstantinos-p added 2 commits November 8, 2023 18:42

Adapt the phrase scale, to more accurately reflect the transformation…

475acda

… performed by each function

Change the example when a copy occurs for the binarize function, fix …

5fd8c70

…other small typos

lesteve changed the title ~~[WIP] minmax_scale has an unexpected behavior when it is called with X of dtype other than float and copy=False.~~ DOC improve documentation of copy=False in preprocessing functions Nov 14, 2023

github-actions bot added the Documentation label Nov 14, 2023

lesteve reviewed Nov 14, 2023

View reviewed changes

konstantinos-p and others added 2 commits December 7, 2023 13:37

Streamline text across all functions with single counterexample using…

02afc9f

… wrong dtypes

lint

8324307

lesteve enabled auto-merge (squash) December 7, 2023 12:45

lesteve approved these changes Dec 7, 2023

View reviewed changes

lesteve merged commit 8be339f into scikit-learn:main Dec 7, 2023
27 checks passed

lesteve mentioned this pull request Apr 10, 2024

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC improve documentation of copy=False in preprocessing functions #27691

DOC improve documentation of copy=False in preprocessing functions #27691

konstantinos-p commented Oct 31, 2023

konstantinos-p commented Oct 31, 2023

github-actions bot commented Oct 31, 2023 •

edited

Loading

lesteve commented Oct 31, 2023

konstantinos-p commented Oct 31, 2023

lesteve Oct 31, 2023 •

edited

Loading

lesteve commented Oct 31, 2023

lesteve commented Nov 6, 2023

betatim commented Nov 6, 2023

geronimos commented Nov 7, 2023

lesteve Nov 7, 2023

lesteve Nov 7, 2023

lesteve Nov 14, 2023 •

edited

Loading

konstantinos-p commented Nov 8, 2023 •

edited

Loading

lesteve commented Nov 14, 2023

lesteve Nov 14, 2023 •

edited

Loading

lesteve Nov 14, 2023

lesteve Nov 14, 2023 •

edited

Loading

konstantinos-p Nov 14, 2023 •

edited

Loading

lesteve Nov 14, 2023 •

edited

Loading

lesteve Nov 14, 2023

lesteve Nov 14, 2023

lesteve commented Dec 7, 2023

DOC improve documentation of copy=False in preprocessing functions #27691

DOC improve documentation of copy=False in preprocessing functions #27691

Conversation

konstantinos-p commented Oct 31, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

konstantinos-p commented Oct 31, 2023

github-actions bot commented Oct 31, 2023 • edited Loading

✔️ Linting Passed

lesteve commented Oct 31, 2023

konstantinos-p commented Oct 31, 2023

lesteve Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

lesteve commented Oct 31, 2023

lesteve commented Nov 6, 2023

betatim commented Nov 6, 2023

geronimos commented Nov 7, 2023

lesteve Nov 7, 2023

Choose a reason for hiding this comment

lesteve Nov 7, 2023

Choose a reason for hiding this comment

lesteve Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

konstantinos-p commented Nov 8, 2023 • edited Loading

lesteve commented Nov 14, 2023

lesteve Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

lesteve Nov 14, 2023

Choose a reason for hiding this comment

lesteve Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

konstantinos-p Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

lesteve Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

lesteve Nov 14, 2023

Choose a reason for hiding this comment

lesteve Nov 14, 2023

Choose a reason for hiding this comment

lesteve commented Dec 7, 2023

github-actions bot commented Oct 31, 2023 •

edited

Loading

lesteve Oct 31, 2023 •

edited

Loading

lesteve Nov 14, 2023 •

edited

Loading

konstantinos-p commented Nov 8, 2023 •

edited

Loading

lesteve Nov 14, 2023 •

edited

Loading

lesteve Nov 14, 2023 •

edited

Loading

konstantinos-p Nov 14, 2023 •

edited

Loading

lesteve Nov 14, 2023 •

edited

Loading