-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC improve documentation of copy=False in preprocessing functions #27691
DOC improve documentation of copy=False in preprocessing functions #27691
Conversation
…loat and copy=False
@lesteve happy to have your thoughts! |
Thanks for the PR! I think the easiest thing to do is to only change the docstring, I would put something close to the scikit-learn/sklearn/preprocessing/_data.py Lines 710 to 714 in a5fed0d
I am not sure whether adding a warning for this case is worth it or not ... if we decide it is worth it then I guess we would need to add test for it. |
…dardScaler docstring
I basically copied the text from StandardScaler and added the case mentioned in Issue #27307. I've replicated the same error in
where the docstring for |
sklearn/preprocessing/_data.py
Outdated
@@ -649,6 +651,7 @@ def minmax_scale(X, feature_range=(0, 1), *, axis=0, copy=True): | |||
X = check_array( | |||
X, copy=False, ensure_2d=False, dtype=FLOAT_DTYPES, force_all_finite="allow-nan" | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove the added newline here, in order to avoid adding unrelated changes?
I think your changes to the docstring look fine. I would wait for a second opinion before updating it in other places. |
Can you put it out of draft mode? The CI error is unrelated. |
Some wording tweak suggestions, otherwise the wording looks good and 👍 for applying it to all the similar locations you found. |
Co-authored-by: Tim Head <[email protected]>
… binarize, quantile_transform, power_transform
…ng of the first sentence
I've seen the related issue today because I also got confused about the Transformer's behavior while setting I would like to suggest being more specific about the data types restriction by saying
According to the note
I'd conclude that the data should either be a NumPy array or of dtype float. However, a NumPy array of dtype I relate to @lesteve comment in the issue:
@konstantinos-p @lesteve Is naming the restriction to dtypes applicable in your opinion? |
sklearn/preprocessing/_data.py
Outdated
Set to False to perform inplace binarization and avoid a copy | ||
(if the input is already a numpy array or a scipy.sparse CSR / CSC | ||
matrix and if axis is 1). | ||
If False, try to avoid a copy and scale in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel in this case (and maybe others) "scale" is not appropriate, so maybe stay closer to the original wording.
(if the input is already a numpy array or a scipy.sparse CSR / CSC | ||
matrix and if axis is 1). | ||
If False, try to avoid a copy and scale in place. | ||
This is not guaranteed to always work in place; e.g. if the data is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case (and maybe others) the previous wording is accurate that scipy CSR or CSC matrices are not copied, so you can not reuse the same doc here.
In general, the fact that a copy is made mostly depends on the check_array
arguments, but the code following it may also need to be looked at closer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the previous docstring was accurate whether the new one is slightly misleading.
The code below is used, which means that it accepts CSR/CSC sparse matrix without copying it with copy=False
:
X = check_array(X, accept_sparse=["csr", "csc"], copy=copy)
… performed by each function
…other small typos
Hello! I've made two commits. In the first, I changed the language for the binarize, quantile_transform, power_transform functions from "scale" to something that more accurately reflects what each function is doing. In the second, I fixed the examples that I give for when a copy occurs. I've checked that they are accurate. binarize returns a copy when it is called on something that is not a NumPy array. All the other functions, as an example, return a copy "if the data is not a NumPy array,or is a scipy.sparse CSR matrix, or is not of dtype float". |
X of dtype other than float
and copy=False
.
FYI I have changed the title of your PR to remove the WIP and make it slighly more compact. This will also affect the commit message when we merge this. |
sklearn/preprocessing/_data.py
Outdated
If False, try to avoid a copy and scale in place. | ||
This is not guaranteed to always work in place; e.g. if the data is | ||
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype | ||
float, a copy may still be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous docstring is more accurate (e.g. CSC matrix does not get copied with copy=False)
Edit: this is fine I read too quickly
sklearn/preprocessing/_data.py
Outdated
If False, try to avoid a copy and scale in place. | ||
This is not guaranteed to always work in place; e.g. if the data is | ||
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype | ||
float, a copy may still be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_array(..., accept_sparse=("csr", "csc"), dtype=FLOAT_DTYPES)
means there is no copy if the matrix is a CSC or CSR matrix and the dtype is of kind float
sklearn/preprocessing/_data.py
Outdated
If False, try to avoid a copy and scale in place. | ||
This is not guaranteed to always work in place; e.g. if the data is | ||
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype | ||
float, a copy may still be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I am started to think that aiming to be fully accurate in the docstring about the copy=False behaviour is not going to be workable. I would instead give 1 or 2 simple examples of case where data is copied even with copy=False
.
For example in this case I would say something like "e.g. if the data is a Numpy array with a non-float dtype, a copy will be returned". Basically let's only look at the first check_array call and give an example where even with copy=False
it will return a copy.
In this particular case the first check_array
call accept csr/csc so no copy but then it depends on the RobustScaler
, for example the CSR will be turned into CSC in RobustScaler.fit
which may share some indices (so partial copy?) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two ways of formulating the docstring. 1) We present the cases where copy = False
works exhaustively 2) we present examples where copy=False
fails. I completely agree that the second approach is more feasible.
As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading. Is however the source of the copy relevant for the docstring?
I see two ways moving forward.
-
I create some sort of test that confirms copies are made in case
x
. For example, I've created a colab where I already check that a copy is made in all the cases that I've listed (even if these don't happen because of check_array, I'm not tracking the source). I can adapt it based on a proposed template. -
I include in all dosctrings only the example
data is not a NumPy array
which triggers a copy throughcheck_array
in all the functions that are part of this PR. (This is basically also what you are proposing if I understand correctly)
I'd personally go for the second option (with maybe a more detailed entry for binarize)? Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like we are on the same line, i.e. giving one or two simple counter-examples, where copy=False still creates a copy so 2).
About ways for moving forward, I would say the testing approach seems too much work and not worth it.
So I would say:
- use the simple example of the numpy array with a non-float dtype everywhere we can, i.e. when the first
check_array
call hasdtype=FLOAT_DTYPES
. This was the confusion in the original issue. More generally I am finding only now (sorry for realising this late) that "is not a Numpy array" is a bit unclear because a scipy sparse matrix is not a numpy array and yet in some cases scipy sparse matrix will not be copied. - treat the remaining cases with more care (I don't have a good view how many there are)
As you mention, copies probably happen in different functions called by (in this case) maxabs_scale and simply looking at the check_array arguments is misleading
Just for completeness, I was trying to say that looking at the first check_array
is enough to find a counter-example where copy=False still creates a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would use the wording and use "a copy will be returned" instead of "a copy may be returned" for the counter-example we give.
In some way it would be nice to find a way to convey that this is a example where copy=False still returns a copy but that there may be (plenty of) others.
sklearn/preprocessing/_data.py
Outdated
If False, try to avoid a copy and normalize in place. | ||
This is not guaranteed to always work in place; e.g. if the data is | ||
not a NumPy array, or is a scipy.sparse CSR matrix, or is not of dtype | ||
float, a copy may still be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be tweaked
LGTM, I set auto-merge on this PR |
Reference Issues/PRs
Example: Fixes #27307 (the issue was taken but had stalled)
What does this implement/fix? Explain your changes.
minmax_scale has an unexpected behavior when it is called with
X of dtype other than float
andcopy=False
. Specifically, even thoughcopy=False
, X is always first passed through the~sklearn.utils.validation.check_array
function. If the dtype of X is not float check_array casts it to float triggering a copy. Any subsequent changes on X happen on the copy and not in-place.I propose to first change the docstring of minmax_scale to reflect this unexpected behaviour. I also propose a check based on two calls of check_array that outputs a warning when this behaviour is detected.
Any other comments?
Other functions in the _data.py file probably have the same behavior and need to be modified accordingly. I'd be happy to change these as well.