Fix covariance matrix mutating feature names #671

MartinStancsicsQC · 2023-08-03T13:45:44Z

Checklist

Added a CHANGELOG.rst entry

This PR implements a minimal fix for #669. It replaces the (non-pure) _set_up_and_check_fit_args method call in covariance_matrix() with the side-effectless check_X_y_tabmat_compliant function plus a few more lines of code.

The fix does work, but I believe that the underlying issue is a bit deeper, and a proper solution would require some refactoring. Basically, it is not totally clear (at least to me) where the line lies between _set_up_and_check_fit_args and check_X_y_tabmat_compliant. For example, converting to MatrixBase and handling integer arrays is done in the former, while I feel it should belong to the latter.

It would be nice to have a method/function (could be check_X_y_tabmat_compliant), after which we can be sure that X is a nice, checked MatrixBase object, but which does not have side effects. It could also take care of aligning categorical features (I believe that the category alignment step was omitted from covariance_matrix before.), or, later on, materializing formuals.

lbittarello · 2023-08-03T14:11:59Z

Change log? :)

MartinStancsicsQC · 2023-08-03T14:16:59Z

Oops, added :) Do you think we should merge this quick fix, or go for refactoring check_X_y_tabmat_compliant and _set_up_and_check_fit_args? (Or do both in this order?)

lbittarello · 2023-08-03T14:18:05Z

I'd say: merge the fix first and refactor later. But let's wait for @MarcAntoineSchmidtQC to weigh in.

MarcAntoineSchmidtQC

I don't think we should spend a lot of time refactoring. I'm good with the proposed solution.

* Do not use _set_up_... in covariance_matrix * Add changelog entry

* Make tests green with densematrix-refactor branch * Remove most Matrixbase subclass checks * Simplify _group_sum * Pre-commit autoupdate (#672) * Use boa in CI. (#673) * Fix covariance matrix mutating feature names (#671) * Do not use _set_up_... in covariance_matrix * Add changelog entry * Add the option to store the covariance matrix to avoid recomputing it (#661) * Add option to store covariance matrix during fit * Fix fitting with variance matrix estimation `.covariance_matrix()` expects X and weights in a different format than what we have at the end of `.fit(). * Store covariance matrix after estimation * Handle the alpha_search and glm_cv cases * Propagate covariance parameters * Add changelog * Slightly more lenient tests * Pre-commit autoupdate (#676) Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> * Fix covariance_matrix dtypes * Make CI use pre-release tabmat * Column names à la Tabmat #278 (#678) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Formula interface (#670) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Improve tutorial * Improve tutorial * Formula- and term-based Wald-tests (#689) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Add back term-based Wald-tests * Tests for term names * Add formula-based Wald-test * Tests for formula-based Wald-test * Add changelog * Fix exception message * Additional test case * make docstrings clearer in the case of terms * Support for missing values in categorical columns (#684) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Implement missing categorical support * Add test * Solve adding missing category when predicting * Apply Matthias' suggestions * Add changelog entry * Fix formula context (#691) * Make tests fail * Propagate context through methods * pyupgrade * ensure_full_rank != drop_first * fix * move feature name assignment to right spot * fix * remove blank line * bump minimum formulaic version (stateful transforms) * improve backward compatibility * Remove code that is not needed in tabmat v4 / glum v3 (#741) * Remove check_array from predict() We don't need it here as predict calls linear_redictor, and the latter does this check. We can avoid doing it twice. * Remove _name_categorical_variable parts There is no need for those as Tabmat v4 handles variable names internally. --------- Co-authored-by: Martin Stancsics <[email protected]> * Fix formula test: consider presence of intercept in full rankness check when constructing the model matrix externally (#746) * deal with intercept in formula test correctly * naming [skip ci] * test varying significance level in coef table test (#749) * pin formulaic to 0.6 (#752) * Add illustration of formula interface to example in README (#751) * add illustration of formula to readme * rephrase * spacing * add linear term for illustration * Determine presence of intercept only by `fit_intercept` argument (#747) * always use self.fit_intercept; raise if formula conflicts with it * wording [skip ci] * adjust other tests, cosmetics * don't compare specs with singular matrix to smf * fix smf test formula * fix intercept in context test * remove outdated sentence; clean up * fix * adjust tutorial * adjust tutorial * consistent linebreaks in docstring * remove obsolete arg in docstring * Informative error when encountering categories that were not seen in training (#748) * drop missings not seen in training * zero not drop * better (?) name [skip ci] * catch case of unseen missings and fail method * fix * respect categorical missing method with formula; test different categorical missing methods also with formula * shorten the tests * dont allow fitting in case of conversion of categoricals and presence of formula * clearer error msg * also change the error msg in the regex (facepalm) * remove matches * fix * better name * describe more restrictive behavior in tutorial * Raise error on unseen levels when predicting * Allow cat_missing_method='convert' again * Update test * Check for unseen categories * Adapt align_df_categories tests to changes * Make pre-commit happy * Avoid unnecessary work * Correctly expand penalties with categoricals and `cat_missing_method="convert"` (#753) * Correctyl expand penalties when cat_missing_method=convert * Add test * Improve variable names Co-authored-by: Matthias Schmidtblaicher <[email protected]> --------- Co-authored-by: Matthias Schmidtblaicher <[email protected]> * bump tabmat pre-release version --------- Co-authored-by: Martin Stancsics <[email protected]> * docstring cosmetics * even more docstring cosmetics * Do not fail when an estimator misses class members that are new in v3 (#757) * do not fail on missing class members that are new in v3 * simplify * convert * shorten the comment * simplify * don't use getattr unnecessarily * cosmetics * fix unrelated typo * tiny cosmetics [skip ci] * No regularization as default (#758) * set alpha=0 as default * fix docstring * add alpha where needed to avoid LinAlgError * add changelog entry * also set alpha in golden master * change name in persisted file too * set alpha in model_parameters again * don't modify case of no alpha attribute, which is RegressorCV * remove invalid alpha argument * wording * Improve code readability * Make arguments to public methods except `X`, `y`, `sample_weight` and `offset` keyword-only and make initialization keyword-only (#764) * make all args except X, y, sample_weight, offset keyword only; make initialization keyword only * add changelog [skip ci] * mention that also RegressorBase was changed [skip ci] * fix import * clean up changelog * Restructure distributions (#768) * Explain `scale_predictors` more (#778) * Expand on effect of scale_predictors and remove note * Update src/glum/_glm.py Co-authored-by: Jan Tilly <[email protected]> * remove sentence --------- Co-authored-by: Jan Tilly <[email protected]> * Move helpers into `_utils` (#782) * Patch docstring * Update CHANGELOG.rst Co-authored-by: Luca Bittarello <[email protected]> * Apply suggestions from code review Co-authored-by: Luca Bittarello <[email protected]> * shorten docstrings of private functions; typos in defaults; other suggestions * context docstring * kwargs * no context as default; small cleanups * add explanation to get calling scope * adjust to tabmat release * keep whitespace * temporarily add tabmat_dev channel again to investigate env solving failure on CI * remove tabmat_dev channel again * for now, disable conda build test on osx and Python 3.12 * Add a different environment for macos (#786) * try solving on ci with different env for macos * add missing if * typo * try and remove --no-test flag * replace deprecated scipy.sparse.*_matrix.A * replace other instance of .A * two more * simply replace all instances of .A by .toarray() (tabmat knows both) * update CHANGELOG for release --------- Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> Co-authored-by: Jan Tilly <[email protected]> Co-authored-by: Marc-Antoine Schmidt <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Martin Stancsics <[email protected]> Co-authored-by: Luca Bittarello <[email protected]> Co-authored-by: lbittarello <[email protected]>

Do not use _set_up_... in covariance_matrix

bd9add4

MartinStancsicsQC requested review from tbenthompson, MarcAntoineSchmidtQC, xhochy, jtilly and lbittarello as code owners August 3, 2023 13:45

MartinStancsicsQC linked an issue Aug 3, 2023 that may be closed by this pull request

covariance_matrix() can overwrite feature names #669

Closed

lbittarello approved these changes Aug 3, 2023

View reviewed changes

Add changelog entry

93c0b25

MarcAntoineSchmidtQC approved these changes Aug 7, 2023

View reviewed changes

Merge branch 'main' into fix-669

e97841b

MartinStancsicsQC merged commit f60a088 into main Aug 8, 2023
19 checks passed

MartinStancsicsQC deleted the fix-669 branch August 8, 2023 10:47

MartinStancsicsQC pushed a commit that referenced this pull request Aug 14, 2023

Fix covariance matrix mutating feature names (#671)

2b8ae3b

* Do not use _set_up_... in covariance_matrix * Add changelog entry

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix covariance matrix mutating feature names #671

Fix covariance matrix mutating feature names #671

MartinStancsicsQC commented Aug 3, 2023 •

edited

Loading

lbittarello commented Aug 3, 2023

MartinStancsicsQC commented Aug 3, 2023

lbittarello commented Aug 3, 2023

MarcAntoineSchmidtQC left a comment

Fix covariance matrix mutating feature names #671

Fix covariance matrix mutating feature names #671

Conversation

MartinStancsicsQC commented Aug 3, 2023 • edited Loading

lbittarello commented Aug 3, 2023

MartinStancsicsQC commented Aug 3, 2023

lbittarello commented Aug 3, 2023

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

MartinStancsicsQC commented Aug 3, 2023 •

edited

Loading