Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC improve documentation of copy=False in preprocessing functions #27691

Merged
merged 11 commits into from
Dec 7, 2023

Conversation

konstantinos-p
Copy link
Contributor

Reference Issues/PRs

Example: Fixes #27307 (the issue was taken but had stalled)

What does this implement/fix? Explain your changes.

minmax_scale has an unexpected behavior when it is called with X of dtype other than float and copy=False. Specifically, even though copy=False, X is always first passed through the ~sklearn.utils.validation.check_array function. If the dtype of X is not float check_array casts it to float triggering a copy. Any subsequent changes on X happen on the copy and not in-place.

I propose to first change the docstring of minmax_scale to reflect this unexpected behaviour. I also propose a check based on two calls of check_array that outputs a warning when this behaviour is detected.

Any other comments?

Other functions in the _data.py file probably have the same behavior and need to be modified accordingly. I'd be happy to change these as well.

@konstantinos-p
Copy link
Contributor Author

@lesteve happy to have your thoughts!

@github-actions
Copy link

github-actions bot commented Oct 31, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8324307. Link to the linter CI: here

@lesteve
Copy link
Member

lesteve commented Oct 31, 2023

Thanks for the PR!

I think the easiest thing to do is to only change the docstring, I would put something close to the copy parameter docstring taking as an example the StandardScaler copy parameter doc:

copy : bool, default=True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is
not a NumPy array or scipy.sparse CSR matrix, a copy may still be
returned.

I am not sure whether adding a warning for this case is worth it or not ... if we decide it is worth it then I guess we would need to add test for it.

@konstantinos-p
Copy link
Contributor Author

I basically copied the text from StandardScaler and added the case mentioned in Issue #27307. I've replicated the same error in

scale
maxabs_scale
robust_scale
normalize
binarize # doesn't copy on float but copies when the input is not an array
quantile_transform
power_transform

where the docstring for copy is also misleading. Once you are ok with the text I can add it to these functions as well, with some minor modifications.

@@ -649,6 +651,7 @@ def minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True):
X = check_array(
X, copy=False, ensure_2d=False, dtype=FLOAT_DTYPES, force_all_finite="allow-nan"
)

Copy link
Member

@lesteve lesteve Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the added newline here, in order to avoid adding unrelated changes?

@lesteve
Copy link
Member

lesteve commented Oct 31, 2023

Once you are ok with the text I can add it to these functions as well, with some minor modifications.

I think your changes to the docstring look fine. I would wait for a second opinion before updating it in other places.

@lesteve lesteve added the Quick Review For PRs that are quick to review label Nov 6, 2023
@lesteve
Copy link
Member

lesteve commented Nov 6, 2023

Can you put it out of draft mode?

The CI error is unrelated.

@betatim
Copy link
Member

betatim commented Nov 6, 2023

Some wording tweak suggestions, otherwise the wording looks good and 👍 for applying it to all the similar locations you found.

@konstantinos-p konstantinos-p marked this pull request as ready for review November 6, 2023 14:08
@geronimos
Copy link

I've seen the related issue today because I also got confused about the Transformer's behavior while setting copy=False. I much appreciate this PR.

I would like to suggest being more specific about the data types restriction by saying

NumPy array of dtype float

According to the note

e.g. if the data is not a NumPy array, is scipy.sparse CSR matrix, or is not of dtype float, a copy may still be returned.

I'd conclude that the data should either be a NumPy array or of dtype float. However, a NumPy array of dtype int is also copied. I assume the condition is NumPy array and dtype float.

I relate to @lesteve comment in the issue:

copy=False only works if the input array dtype is a float dtype, i.e. float64, float32 or float16 right now. I guess maybe the documentation could be improved to mention this?

In your case the input array dtype is an int dtype.

@konstantinos-p @lesteve Is naming the restriction to dtypes applicable in your opinion?

Set to False to perform inplace binarization and avoid a copy
(if the input is already a numpy array or a scipy.sparse CSR / CSC
matrix and if axis is 1).
If False, try to avoid a copy and scale in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel in this case (and maybe others) "scale" is not appropriate, so maybe stay closer to the original wording.

(if the input is already a numpy array or a scipy.sparse CSR / CSC
matrix and if axis is 1).
If False, try to avoid a copy and scale in place.
This is not guaranteed to always work in place; e.g. if the data is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case (and maybe others) the previous wording is accurate that scipy CSR or CSC matrices are not copied, so you can not reuse the same doc here.

In general, the fact that a copy is made mostly depends on the check_array arguments, but the code following it may also need to be looked at closer.

Copy link
Member

@lesteve lesteve Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the previous docstring was accurate whether the new one is slightly misleading.

The code below is used, which means that it accepts CSR/CSC sparse matrix without copying it with copy=False:

X = check_array(X, accept_sparse=["csr", "csc"], copy=copy)

@konstantinos-p
Copy link
Contributor Author

konstantinos-p commented Nov 8, 2023

Hello! I've made two commits. In the first, I changed the language for the binarize, quantile_transform, power_transform functions from "scale" to something that more accurately reflects what each function is doing. In the second, I fixed the examples that I give for when a copy occurs. I've checked that they are accurate. binarize returns a copy when it is called on something that is not a NumPy array. All the other functions, as an example, return a copy "if the data is not a NumPy array,or is a scipy.sparse CSR matrix, or is not of dtype float".

@lesteve lesteve changed the title [WIP] minmax_scale has an unexpected behavior when it is called with X of dtype other than float and copy=False. DOC improve documentation of copy=False in preprocessing functions Nov 14, 2023
@lesteve
Copy link
Member

lesteve commented Nov 14, 2023

FYI I have changed the title of your PR to remove the WIP and make it slighly more compact. This will also affect the commit message when we merge this.

If False, try to avoid a copy and scale in place.
This is not guaranteed to always work in place; e.g. if the data is
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
float, a copy may still be returned.
Copy link
Member

@lesteve lesteve Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous docstring is more accurate (e.g. CSC matrix does not get copied with copy=False)

Edit: this is fine I read too quickly

If False, try to avoid a copy and scale in place.
This is not guaranteed to always work in place; e.g. if the data is
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
float, a copy may still be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_array(..., accept_sparse=("csr", "csc"), dtype=FLOAT_DTYPES) means there is no copy if the matrix is a CSC or CSR matrix and the dtype is of kind float

If False, try to avoid a copy and scale in place.
This is not guaranteed to always work in place; e.g. if the data is
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
float, a copy may still be returned.
Copy link
Member

@lesteve lesteve Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I am started to think that aiming to be fully accurate in the docstring about the copy=False behaviour is not going to be workable. I would instead give 1 or 2 simple examples of case where data is copied even with copy=False.

For example in this case I would say something like "e.g. if the data is a Numpy array with a non-float dtype, a copy will be returned". Basically let's only look at the first check_array call and give an example where even with copy=False it will return a copy.

In this particular case the first check_array call accept csr/csc so no copy but then it depends on the RobustScaler, for example the CSR will be turned into CSC in RobustScaler.fit which may share some indices (so partial copy?) ...

Copy link
Contributor Author

@konstantinos-p konstantinos-p Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways of formulating the docstring. 1) We present the cases where copy = False works exhaustively 2) we present examples where copy=False fails. I completely agree that the second approach is more feasible.

As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading. Is however the source of the copy relevant for the docstring?

I see two ways moving forward.

  1. I create some sort of test that confirms copies are made in case x. For example, I've created a colab where I already check that a copy is made in all the cases that I've listed (even if these don't happen because of check_array, I'm not tracking the source). I can adapt it based on a proposed template.

  2. I include in all dosctrings only the example data is not a NumPy array which triggers a copy through check_array in all the functions that are part of this PR. (This is basically also what you are proposing if I understand correctly)

I'd personally go for the second option (with maybe a more detailed entry for binarize)? Let me know what you think.

Copy link
Member

@lesteve lesteve Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like we are on the same line, i.e. giving one or two simple counter-examples, where copy=False still creates a copy so 2).

About ways for moving forward, I would say the testing approach seems too much work and not worth it.

So I would say:

  • use the simple example of the numpy array with a non-float dtype everywhere we can, i.e. when the first check_array call has dtype=FLOAT_DTYPES. This was the confusion in the original issue. More generally I am finding only now (sorry for realising this late) that "is not a Numpy array" is a bit unclear because a scipy sparse matrix is not a numpy array and yet in some cases scipy sparse matrix will not be copied.
  • treat the remaining cases with more care (I don't have a good view how many there are)

As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading

Just for completeness, I was trying to say that looking at the first check_array is enough to find a counter-example where copy=False still creates a copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would use the wording and use "a copy will be returned" instead of "a copy may be returned" for the counter-example we give.

In some way it would be nice to find a way to convey that this is a example where copy=False still returns a copy but that there may be (plenty of) others.

If False, try to avoid a copy and normalize in place.
This is not guaranteed to always work in place; e.g. if the data is
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype
float, a copy may still be returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be tweaked

@lesteve lesteve enabled auto-merge (squash) December 7, 2023 12:45
@lesteve
Copy link
Member

lesteve commented Dec 7, 2023

LGTM, I set auto-merge on this PR

@lesteve lesteve merged commit 8be339f into scikit-learn:main Dec 7, 2023
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

I can not achieve inplace scaling by using sklearn.preprocessing.minmax_scale
4 participants