Skip to content

LogisticRegression with SAGA using sample_weight does not converge #21305

@vttrifonov

Description

@vttrifonov

Describe the bug

I am fitting a logistic regression model on a sparse matrix and a binary response. Many of the rows of the matrix are repeated, so to speed things up I switched to a smaller sparse matrix with non-repeated rows and use the repetitions of the rows to calculate the sample_weight argument to fit.

The issue is that when I work with the weighted model, fit produces a warning that it dit not converge because it took too many steps.

I looked a bit under the hood of fit and sag_solver does some scaling of alpha and beta using n_samples. The resulting alpha_scaled and beta_scaled are different between the weighted and unweighted cases and they should not be (the loss function is the same). Perhaps the equivalent scaling for the weighted case should be the sum of the weights (if the intended 'unit' of the weights is 'count') and not just n_samples. Not sure if this is the issue, but it just it made me worry that the sample_weight argument is used in a bit naive way just as a multiplier for the loss function, while there might be scaling implications that are not accounted for when deciding when to stop.

UPDATE: It seems that the issue is with the SAGA solver. I tried with liblinear and it seems to work. This will solve my immediate problem, because for now I only want the L1-reg. It is still good to look into this, because at the moment only SAGA offers elasticnet.

Steps/Code to Reproduce

vttrifonov/logistic_sample_weights.ipynb

Expected Results

In the above code I expect the second fit to run much faster than the first and to produce the same coefficients.

Actual Results

It fails in both.

Versions

System:
python: 3.7.0 (default, Jun 28 2018, 07:39:16) [Clang 4.0.1 (tags/RELEASE_401/final)]
executable: /Users/vtrifonov/projects/tiny-proteins/env/bin/python
machine: Darwin-19.6.0-x86_64-i386-64bit

Python dependencies:
pip: 21.2.2
setuptools: 58.0.4
sklearn: 0.24.2
numpy: 1.20.3
scipy: 1.7.1
Cython: None
pandas: 1.3.2
matplotlib: 3.4.2
joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions