Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

aivarsoo · 2024-04-09T10:42:42Z

Describe the bug

The parameter copy of the function mutual_info_regression is described as follows

scikit-learn/sklearn/feature_selection/_mutual_info.py

Lines 381 to 383 in d1d1596

    
               copy : bool, default=True 
        
                   Whether to make a copy of the given data. If set to False, the initial 
        
                   data will be overwritten.

I read it as both X and y should be modified if copy=False and X has continuous features. However, y is always copied. I think the lines

scikit-learn/sklearn/feature_selection/_mutual_info.py

Lines 309 to 310 in d1d1596

    
           if not discrete_target: 
        
               y = scale(y, with_mean=False)

should be

    if not discrete_target:
        y = y.astype(np.float64, copy=copy)
        y = scale(y, with_mean=False, copy=False)

Similarly to the the treatment of X

scikit-learn/sklearn/feature_selection/_mutual_info.py

Lines 295 to 299 in d1d1596

    
           if np.any(continuous_mask): 
        
               X = X.astype(np.float64, copy=copy) 
        
               X[:, continuous_mask] = scale( 
        
                   X[:, continuous_mask], with_mean=False, copy=False 
        
               )

Steps/Code to Reproduce

import numpy as np
from sklearn.feature_selection import mutual_info_regression
n_samples_, n_feats = 30, 2
X = np.random.randn(n_samples_, n_feats)
y = np.random.randn(n_samples_, )
y_copy = y.copy()
X_copy = X.copy()
mutual_info_regression(X, y, copy=False)
print(np.allclose(y, y_copy), np.allclose(X, X_copy))

Expected Results

The result should be

False, False

since both X and y should be modified in place by the function mutual_info_regression.

Actual Results

True, False

Versions

System:
    python: 3.11.7 (main, Dec  5 2023, 19:13:35) [GCC 10.2.1 20210110]
executable: /usr/local/bin/python
   machine: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.3.0
          pip: 23.3.1
   setuptools: 69.0.2
        numpy: 1.23.2
        scipy: 1.12.0
       Cython: 3.0.9
       pandas: 2.1.4
   matplotlib: 3.8.2
       joblib: 1.3.2
threadpoolctl: 3.2.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /home/user/.local/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-742d56dc.3.20.so
        version: 0.3.20
threading_layer: pthreads
   architecture: Prescott

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: /home/user/.local/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-23e5df77.3.21.dev.so
        version: 0.3.21.dev
threading_layer: pthreads
   architecture: Prescott

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: libgomp
       filepath: /home/user/.local/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

lesteve · 2024-04-10T07:49:19Z

It kinds of remind me of #27307 and #27691. In general, copy=False means "it may avoid a copy" but it is not always possible. In other words, copy=False may reduce the memory usage, but the fact that the input array is modified should not be relied upon.

I think there are two complementary ways to improve the situation:

change the docstring to a similar wording as in DOC improve documentation of copy=False in preprocessing functions #27691
if the code changes that you suggest improves the situation, why not? I have to say I am not sure we have some tests to check the copy=False behaviour.

aivarsoo · 2024-04-10T08:23:28Z

Regarding the first way. In this case, I guess the docstring should qualify that only X is affected by copy, but y is always copied in addition to the issues with sparse matrices.

I am a bit biased here as I want y to have the same behavior of X for my application, but I am trying to understand if there's a reason to always copy y?

lesteve · 2024-04-10T08:35:50Z

It would help a lot if you could explain your use case a bit 🙏. In particular how does it affect you that y is copied and not X when using copy=False?

aivarsoo · 2024-04-10T09:35:44Z

This issue may go in a bit different direction now :) Let me know if i should separate it into a new one.

I am updating a third party sklearn plug-in for feature selection, where we want to add a possibility for feature selection using conditional mutual information based on the paper. Note that unlike the paper itself, our solution will use a proprietary solver, but the plug-in itself is open-source.

Mathematically we will aim to solve

$$ \max\limits_{\delta_i} \sum\limits_{i=1}^n I(x_i; y) \delta_i + \sum\limits_{i, j=1}^n I(x_i; y | x_j) \delta_i \delta_j $$

where $x_i$ are features, $y$ is the target variable, and the variables $\delta_i\in{0,1}$ are binary with an associated constraint $\sum \delta_i = K$ for a pre-defined value $K$. If the variable $\delta_i =1$ then we will pick the feature $i$.

The conditional mutual information algorithm will follow mostly this implementation, which is in turn based on the sklearn implementation for mutual information computation.

Essentially, I want to compute the values $I(x_i; y)$, which I can do using the public methods mutual_info_classif, mutual_info_regression from sklearn.feature_selection or a private method _compute_mi from sklearn.feature_selection._mutual_info.

If I use the public method mutual_info_regression, X and y will be rescaled if they correspond to a data of a non-discrete distribution. I will then need to rescale y again for computing $I(x_i; y | x_j)$. If X and y would have the same behavior, then I can avoid rescaling y twice and essentially using different data y for computing $I(x_i; y)$ and $I(x_i; y | x_j)$.
If we can make the method _compute_mi public, then I can use my custom function for computing the values of $I(x_i; y | x_j)$ and $I(x_i; y)$ simultaneously.

The second option would also allow for using the problem formulation described in the mRMR paper. Similar to an existing issue #8889 but using our solver. I was actually considering making the second option (making _compute_mi public) an issue too, but decided to start with a smaller issue.

I am happy to submit a PR on some of these issues.

lesteve · 2024-04-11T09:32:45Z

Thanks for the details!

So my understanding is that you do something like this:

mutual_info_1 = mutual_info_regression(X, y, copy=False)
# make assumptions on how X and y have been modified and call `mutual_info_regression` again
mutual_info_2 = mutual_info_regression(..., copy=False)

How much do you care about using copy=False, in other words, does using copy=True have a big effect in your use case? One suggestion would be to use copy=True in the first mutual_info_regression call and then copy=False in the second one (if that really makes a difference).

My current understanding that may not be fully accurate: when using copy=False there is no guarantee on how your input values may change, basically you may avoid a copy at the cost of not being able to use your input array after the function call.

Even if copy=False makes a big difference in your use case, I would rather rely on a private function that on how X and y are modified when using copy=False. I don't think there is any guarantee on this and this may change without warnings, at least if we rename or remove _compute_mi you will get an error.

Then there is the question on _compute_mi public, I am not very familiar with this part of scikit-learn, so I don't have an informed opinion. One thing to have in mind is that having an additional public function could lead users to confusion and/or not using the correct public function.

aivarsoo · 2024-04-11T15:08:39Z

Thanks for the response!

An example code would be something like

# We rescale X and y in the sklearn method computing I(x_i; y) for all i:
mutual_info = mutual_info_regression(X, y, copy=False)
# We re-use the rescaled X and y in our method computing I(x_i; y| x_j) 
# for all i and j , where we do not modify X or y
conditional_mutual_info = conditional_mutual_info(X, y)

Note that the mutual_info methods rescale and add noise to X and y, which I wouldn't want to do several times. Also I want to use the same X and y for all the computations/estimations.

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality? If so then I may need to look for a different solution anyway.

On another note, I think importing the private sklearn method wouldn't work in the long run for our plug-in.

lesteve · 2024-04-11T15:29:39Z

But if I understood you correctly using copy=False doesn't give me much guarantees on what will happen to X after the call to this method and I shouldn't rely on this functionality?

Yes, I asked other maintainers inputs and basically there seems to be agreement on this: if you use sklearn_func(X, copy=False) don't reuse X because there is no guarantee on how its values have changed after the sklearn_func call.

I am going to label this issue as Documentation because I think the most reasonable thing to do would be to document this in the glossary. I guess adding a link to the glossary entry in all the function/classes that uses copy may be a good idea too but quite some work.

aivarsoo · 2024-04-12T08:39:47Z

Many thanks!

aivarsoo added Bug Needs Triage Issue requires triage labels Apr 9, 2024

lesteve added help wanted and removed Needs Triage Issue requires triage labels Apr 10, 2024

lesteve added Documentation and removed Bug labels Apr 11, 2024

lamdang2k linked a pull request Nov 19, 2024 that will close this issue

DOC Document copy parameter and add link to the glossary entry #30307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

aivarsoo commented Apr 9, 2024

lesteve commented Apr 10, 2024 •

edited

Loading

aivarsoo commented Apr 10, 2024 •

edited

Loading

lesteve commented Apr 10, 2024

aivarsoo commented Apr 10, 2024

lesteve commented Apr 11, 2024 •

edited

Loading

aivarsoo commented Apr 11, 2024

lesteve commented Apr 11, 2024 •

edited

Loading

aivarsoo commented Apr 12, 2024

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

Unexpected behavior of sklearn.feature_selection.mutual_info_regression if copy=False #28793

Comments

aivarsoo commented Apr 9, 2024

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

lesteve commented Apr 10, 2024 • edited Loading

aivarsoo commented Apr 10, 2024 • edited Loading

lesteve commented Apr 10, 2024

aivarsoo commented Apr 10, 2024

lesteve commented Apr 11, 2024 • edited Loading

aivarsoo commented Apr 11, 2024

lesteve commented Apr 11, 2024 • edited Loading

aivarsoo commented Apr 12, 2024

lesteve commented Apr 10, 2024 •

edited

Loading

aivarsoo commented Apr 10, 2024 •

edited

Loading

lesteve commented Apr 11, 2024 •

edited

Loading

lesteve commented Apr 11, 2024 •

edited

Loading