Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVC and OneClassSVM fails to fit or have wrong fitted attributes with null sample weights #25380

Open
andrewdelong opened this issue Jan 12, 2023 · 12 comments · May be fixed by #27763
Open

SVC and OneClassSVM fails to fit or have wrong fitted attributes with null sample weights #25380

andrewdelong opened this issue Jan 12, 2023 · 12 comments · May be fixed by #27763
Labels
Bug Hard Hard level of difficulty help wanted

Comments

@andrewdelong
Copy link
Contributor

Describe the bug

SVC().fit(X, y, w) fails when the targets y are multiclass and the sample_weights w zero out one of the classes.

  • Dense X produces incorrect arrays for support_, n_support_, and dual_coef_ attributes, some with wrong dimension.
  • Sparse X errors out when trying to construct sparse dual_coef_ from incompatible arguments.

A warning is emitted (e.g., "class label 0 specified in weight is not found"), but it does not indicate that the arrays on the trained SVC object are incorrect.

Seems to be a case that was not tested by PR #14286.

Workaround

Replace the zero weights (or negative weights) with very small values like 1e-16.

Steps/Code to Reproduce

import numpy as np
from sklearn.svm import SVC

X = np.array([[0., 0.], [1., 0.], [0., 1.]])   # or sp.csr_matrix(...)
y = [0, 1, 2]
w = [0., 1., 1.]                               # class 0 has zero weight
clf = SVC().fit(X, y, w)

Expected Results

The fitted attributes should be

classes_: [0 1 2]
support_: [1 2]
support_vectors_: [[1. 0.]
                   [0. 1.]]
n_support_: [0 1 1]
dual_coef_: [[0. 0. 0.]
             [0. 1. -1.]]

assuming the 'arbitrary' values in dual_coef_ are set to zero.

Actual Results

For dense X, the fitted attributes are actually

classes_: [0 1 2]
support_: [0 1]              <-- should be [1 2]
support_vectors_: [[1. 0.]
                   [0. 1.]]
n_support_: [1 1]            <-- should be [0 1 1]
dual_coef_: [[ 1. -1.]]      <-- should be [[0. 0. 0.] [0. 1. -1.]]

For sparse X it raises

ValueError: indices and data should have the same size

with traceback

sklearn/svm/_base.py:252, in fit(self, X, y, sample_weight)
--> 252 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)

sklearn/svm/_base.py:413, in _sparse_fit(self, X, y, sample_weight, solver_type, kernel, random_seed)
--> 413     self.dual_coef_ = sp.csr_matrix(
    414         (dual_coef_data, dual_coef_indices, dual_coef_indptr), (n_class, n_SV)
    415     )

scipy/sparse/_compressed.py:106, in _cs_matrix.__init__(self, arg1, shape, dtype, copy)
--> 106 self.check_format(full_check=False)

scipy/sparse/_compressed.py:176, in _cs_matrix.check_format(self, full_check)
    174 # check index and data arrays
    175 if (len(self.indices) != len(self.data)):
--> 176     raise ValueError("indices and data should have the same size")

Versions

System:
    python: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:55:37)  [Clang 14.0.6 ]
executable: miniconda3/bin/python
   machine: macOS-13.1-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.24.1
        scipy: 1.10.0
       Cython: 0.29.33
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
         prefix: libmkl_rt
       filepath: miniconda3/lib/libmkl_rt.dylib
        version: 2020.0.4
threading_layer: intel
    num_threads: 8

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: miniconda3/lib/libomp.dylib
        version: None
    num_threads: 16
@andrewdelong andrewdelong added Bug Needs Triage Issue requires triage labels Jan 12, 2023
@andrewdelong
Copy link
Contributor Author

This bug report replaces bug#19654, which misattributed the bug to libsvm_sparse specifically.

@adrinjalali
Copy link
Member

Thanks for the report. Indeed this could be fixed. So I'll tag this accordingly. Would be nice if you could submit a PR to fix it @andrewdelong

@adrinjalali adrinjalali added help wanted Moderate Anything that requires some knowledge of conventions and best practices and removed Needs Triage Issue requires triage labels Apr 21, 2023
@rand0wn
Copy link
Contributor

rand0wn commented Jun 12, 2023

/take

@rand0wn
Copy link
Contributor

rand0wn commented Jun 12, 2023

I am trying to work on this issue, and trying the workaround as suggested by @andrewdelong

import numpy as np
from sklearn.svm import SVC

X = np.array([[0., 0.], [1., 0.], [0., 1.]])   # or sp.csr_matrix(...)
y = [0, 1, 2]
w = np.array([0., 1., 1.])                          
w[w <= 0] = 1e-16
print(w)
clf = SVC().fit(X, y, w)

However, the results are not as expected

Results

w: [1.e-16 1.e+00 1.e+00]
classes_:  [0 1 2]
support_:  [0 1 2]
support_vectors_:  [[0. 0.]
 [1. 0.]
 [0. 1.]]
n_support_:  [1 1 1]
dual_coef_:  [[ 1.e-16 -1.e-16 -1.e-16]
 [ 1.e-16  1.e+00 -1.e+00]]

Need your suggestions on what to try next? @adrinjalali @andrewdelong

@adrinjalali
Copy link
Member

I certainly wouldn't try to change sample weights as a workaround. We probably want to run svm with those classes omitted in this case.

@rand0wn
Copy link
Contributor

rand0wn commented Jun 13, 2023

So one way we can do this is to remove values from X, y, and w, where w == 0. Or we can find a way to skip this in training ?

@adrinjalali
Copy link
Member

@rand0wn figuring it out would be a part of solving this issue 😁 I haven't jumped deep into that part of the codebase in a while.

@rand0wn
Copy link
Contributor

rand0wn commented Jun 13, 2023

Got it will try to figure out ways it's possible, and run the testcases

@rand0wn
Copy link
Contributor

rand0wn commented Jun 14, 2023

On Investigation, all the variations except multiclass SVC remove zero_weights

if(svm_type == C_SVC ||
	   svm_type == EPSILON_SVR ||
	   svm_type == NU_SVR ||
	   svm_type == ONE_CLASS)
	{
		PREFIX(problem) newprob;
		// filter samples with negative and null weights
		remove_zero_weight(&newprob, prob);

Either we can do the same with multiclass as well or return an error

For example in LinearSVC

import numpy as np
from sklearn.svm import LinearSVC

X = np.array([[0., 0.], [1., 0.], [0., 1.]])   # or sp.csr_matrix(...)
y = [0, 1, 2]
w = [0., 1., 1.]                               # class 0 has zero weight
clf = LinearSVC().fit(X, y, w)
/Users/abhishek/scikit-learn/sklearn/svm/_classes.py:32: FutureWarning: The default value of `dual` will change from `True` to `'auto'` in 1.5. Set the value of `dual` explicitly to suppress the warning.
  warnings.warn(
WARNING: class label 0 specified in weight is not found

classes_:  [0 1 2]
coef_:  [[-0.66666488  0.66666219]] # 2 Coef for 3 classes
intercept_:  [-2.68435456e-06]
intercept_:  [-2.68435456e-06]

While with Linear Kernel in SVC

import numpy as np
from sklearn.svm import SVC

X = np.array([[0., 0.], [1., 0.], [0., 1.]])   # or sp.csr_matrix(...)
y = np.array([0, 1, 2])
w = np.array([0., 1., 1.])                          

clf = SVC(kernel='linear').fit(X, y, w)

classes_:  [0 1 2]
support_:  [0 1]
support_vectors_:  [[1. 0.]
 [0. 1.]]
n_support_:  [1 1]
dual_coef_:  [[ 1. -1.]]

Also when removing zero-weight classes

import numpy as np
from sklearn.svm import SVC

X = np.array([[0., 0.], [1., 0.], [0., 1.]])   # or sp.csr_matrix(...)
y = np.array([0, 1, 2])
w = np.array([0., 1., 1.])                          
zero_weight_index = (w == 0)
X = X[~zero_weight_index]
y = y[~zero_weight_index]
w = w[~zero_weight_index]

clf = SVC().fit(X, y, w)

classes_:  [1 2]
support_:  [0 1]
support_vectors_:  [[1. 0.]
 [0. 1.]]
n_support_:  [1 1]
dual_coef_:  [[-1.  1.]]

Gives the same thing without warning, my suggestion is either to remove zero weight classes for all to make it consistent or return an error, thoughts?

@glemaitre @adrinjalali

@adrinjalali
Copy link
Member

I think making it consistent with the other cases makes sense. Thanks for investigating it @rand0wn

@glemaitre
Copy link
Member

Looking at the code I don't think that the results of the investigation is right.
The negative and null sample-weights are removed even in the multiclass case: C-SVC is the same type for binary or multiclass.

As mentioned in the original post, the dense and sparse case are wrong. The sparse case fail for inconsistent shape but the dense case shows the issue: the internal fit of libsvm was fit only on 2 class and thus we have an issue in the reported values. I did not check the code yet but we should take care of some offsets that should be apply once the fit is done.

@glemaitre
Copy link
Member

The remove_zero_weight should create a mapping to store what samples have been removed because it will create an offset in the support vector index. In addition, we need to find out that we completely remove a specific class and inject this information in the dual_coef_ and n_support_.

All of these need to be done in the svm.cpp file to be meaningful.

At least we have a good non-regression case. But as far I can say, we should suffer from the same issue for the OneClassSVM.

@glemaitre glemaitre changed the title Multiclass SVC.fit fails if sample_weight zeros out a class SVC and OneClassSVM fails to fit or have wrong fitted attributes with null sample weight Nov 9, 2023
@glemaitre glemaitre changed the title SVC and OneClassSVM fails to fit or have wrong fitted attributes with null sample weight SVC and OneClassSVM fails to fit or have wrong fitted attributes with null sample weights Nov 9, 2023
@glemaitre glemaitre added Hard Hard level of difficulty and removed Moderate Anything that requires some knowledge of conventions and best practices labels Nov 9, 2023
@glemaitre glemaitre linked a pull request Nov 10, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Hard Hard level of difficulty help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants