Formula interface #670

MartinStancsicsQC · 2023-08-02T16:26:15Z

Checklist

Added a CHANGELOG.rst entry

Summary

This PR adds support for specifying models via Wilkinson-formulas (just like in R or statsmodels). The implementation is based on formulaic, and is relatively full-featured (supports most of what statsmodels does). As a not-so-side effect, it allows for the on-the-fly specification of potentially complex interaction terms (c.f. #515 and #583).

Related `tabmat` changes

Most of the machinery needed to make this possible is implemented in tabmat. Please also take a look at the related tabmat PR #267.

New dependencies

This PR introduces a set of new dependencies. glum will directly depend on formulaic, which in turn requires astor, cached_property, graphlib-backport, interface_meta, and wrapt. Apart from wrapt, all of these are pure python packages, and even wrapt has a Python fallback.

Example

The PR also includes a tutorial on how the formula interface works in glum. You can find the notebook here. To learn more about the formula interface on the tabmat side, please check out this example notebook.

MartinStancsicsQC · 2023-08-02T16:32:33Z

If someone'd like to try it out in action, just install the formula branch of the tabmat repo and the glum repo simultaneously in the same environment.

matthewwardrop

Excited to see Formulaic gaining organic adoptions like this :).

src/glum/_glm.py

MatthiasSchmidtblaicherQC · 2023-08-22T08:26:16Z

docs/tutorials/formula_interface/formula_interface.ipynb

Great tutorial! It would benefit from a short newspaper-style summary of the section, including code, of the main syntax and advantages (even if the code cannot be run). This
should be enough for most users to get started. I.e. something like a simplified combination of the formulas from below:

GeneralizedLinearRegressor( family=TweedieDist, alpha_search=True, l1_ratio=1, formula="{ClaimAmountCut / Exposure} ~ C(VehBrand) + C(DrivAge) * C(VehPower) + bs(BonusMalus, 3) + 1", )

then mentioning:

Efficient categorical encoding,

Interactions on the fly, including for categoricals,

Feature preprocessing such as splines,

Creation of outcomes on the fly, and

Intercept.

I would also reorder the sections below according to importance of the feature:

Categoricals.

Interactions on the fly.

Functions.

Miscellaneous.

Some specific comments follow:

Header:

Misspelled linearmodels.

"Reproducing the model from Tutorial 1" should be made title case. Also "Rank" capitalized in "Structural Full-Rankness". Please check all headings that they are in title case.

" instead as categoricals." -> " instead of as categoricals."

"can actually be incorporated" -> "can be incorporated"

I would phrase this differently along the lines of: a huge part of tabmat's/glum's performance advantage is that categoricals need not be one-hot encoded. We can leverage this advantage also within formulas by using the C function. If one wants to use other categorical encodings than one-hot, you can always do that before the glum estimator.

"regardless of it's type"-> "regardless of its type".

typo "categorixals".

"because e.g. a categorical"-> "because, e.g., a categorical"

"as before" -> "as in the interface without formulas". Not all users will come from a "before".

Thanks Matthias, great ideas! Would it be okay with you if I put these suggestions on a "to be done before v3 release" backlog, but merge the PR before it's fully done so that we can test formulas in the pre-release?

@MatthiasSchmidtblaicherQC, let me know what you think about the improved version of the tutorial. I've also added a subsection for a cool new feature: deciding what to do with missing categoricals on a column-by-column basis

src/glum/_glm.py

tests/glm/test_glm.py

MatthiasSchmidtblaicherQC

Some small suggestions. Otherwise, it looks good!

MatthiasSchmidtblaicherQC · 2023-08-25T16:36:58Z

docs/tutorials/formula_interface/formula_interface.ipynb

Why not move the sneak peek before the table of contents?

typo "forluma"

"statsmodels/linearmodels" -> "statsmodels or linearmodels"

In "The predictors should include the interactions of the categorical variables DrivAge and VehPower, as well as those two variables themselves.", add something like: "Neither the individual categoricals nor their interaction will be dummy-encoded by glum. For categoricals with many levels, this can lead to a substantial performance improvement over dummy encoding, especially for the interaction."

MarcAntoineSchmidtQC

Great work! Well-tested PR and a super valuable addition to our features.

Please take a look at my small cosmetic comments.

CHANGELOG.rst

docs/tutorials/formula_interface/formula_interface.ipynb

* Make tests green with densematrix-refactor branch * Remove most Matrixbase subclass checks * Simplify _group_sum * Pre-commit autoupdate (#672) * Use boa in CI. (#673) * Fix covariance matrix mutating feature names (#671) * Do not use _set_up_... in covariance_matrix * Add changelog entry * Add the option to store the covariance matrix to avoid recomputing it (#661) * Add option to store covariance matrix during fit * Fix fitting with variance matrix estimation `.covariance_matrix()` expects X and weights in a different format than what we have at the end of `.fit(). * Store covariance matrix after estimation * Handle the alpha_search and glm_cv cases * Propagate covariance parameters * Add changelog * Slightly more lenient tests * Pre-commit autoupdate (#676) Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> * Fix covariance_matrix dtypes * Make CI use pre-release tabmat * Column names à la Tabmat #278 (#678) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Formula interface (#670) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Improve tutorial * Improve tutorial * Formula- and term-based Wald-tests (#689) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Add back term-based Wald-tests * Tests for term names * Add formula-based Wald-test * Tests for formula-based Wald-test * Add changelog * Fix exception message * Additional test case * make docstrings clearer in the case of terms * Support for missing values in categorical columns (#684) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Implement missing categorical support * Add test * Solve adding missing category when predicting * Apply Matthias' suggestions * Add changelog entry * Fix formula context (#691) * Make tests fail * Propagate context through methods * pyupgrade * ensure_full_rank != drop_first * fix * move feature name assignment to right spot * fix * remove blank line * bump minimum formulaic version (stateful transforms) * improve backward compatibility * Remove code that is not needed in tabmat v4 / glum v3 (#741) * Remove check_array from predict() We don't need it here as predict calls linear_redictor, and the latter does this check. We can avoid doing it twice. * Remove _name_categorical_variable parts There is no need for those as Tabmat v4 handles variable names internally. --------- Co-authored-by: Martin Stancsics <[email protected]> * Fix formula test: consider presence of intercept in full rankness check when constructing the model matrix externally (#746) * deal with intercept in formula test correctly * naming [skip ci] * test varying significance level in coef table test (#749) * pin formulaic to 0.6 (#752) * Add illustration of formula interface to example in README (#751) * add illustration of formula to readme * rephrase * spacing * add linear term for illustration * Determine presence of intercept only by `fit_intercept` argument (#747) * always use self.fit_intercept; raise if formula conflicts with it * wording [skip ci] * adjust other tests, cosmetics * don't compare specs with singular matrix to smf * fix smf test formula * fix intercept in context test * remove outdated sentence; clean up * fix * adjust tutorial * adjust tutorial * consistent linebreaks in docstring * remove obsolete arg in docstring * Informative error when encountering categories that were not seen in training (#748) * drop missings not seen in training * zero not drop * better (?) name [skip ci] * catch case of unseen missings and fail method * fix * respect categorical missing method with formula; test different categorical missing methods also with formula * shorten the tests * dont allow fitting in case of conversion of categoricals and presence of formula * clearer error msg * also change the error msg in the regex (facepalm) * remove matches * fix * better name * describe more restrictive behavior in tutorial * Raise error on unseen levels when predicting * Allow cat_missing_method='convert' again * Update test * Check for unseen categories * Adapt align_df_categories tests to changes * Make pre-commit happy * Avoid unnecessary work * Correctly expand penalties with categoricals and `cat_missing_method="convert"` (#753) * Correctyl expand penalties when cat_missing_method=convert * Add test * Improve variable names Co-authored-by: Matthias Schmidtblaicher <[email protected]> --------- Co-authored-by: Matthias Schmidtblaicher <[email protected]> * bump tabmat pre-release version --------- Co-authored-by: Martin Stancsics <[email protected]> * docstring cosmetics * even more docstring cosmetics * Do not fail when an estimator misses class members that are new in v3 (#757) * do not fail on missing class members that are new in v3 * simplify * convert * shorten the comment * simplify * don't use getattr unnecessarily * cosmetics * fix unrelated typo * tiny cosmetics [skip ci] * No regularization as default (#758) * set alpha=0 as default * fix docstring * add alpha where needed to avoid LinAlgError * add changelog entry * also set alpha in golden master * change name in persisted file too * set alpha in model_parameters again * don't modify case of no alpha attribute, which is RegressorCV * remove invalid alpha argument * wording * Improve code readability * Make arguments to public methods except `X`, `y`, `sample_weight` and `offset` keyword-only and make initialization keyword-only (#764) * make all args except X, y, sample_weight, offset keyword only; make initialization keyword only * add changelog [skip ci] * mention that also RegressorBase was changed [skip ci] * fix import * clean up changelog * Restructure distributions (#768) * Explain `scale_predictors` more (#778) * Expand on effect of scale_predictors and remove note * Update src/glum/_glm.py Co-authored-by: Jan Tilly <[email protected]> * remove sentence --------- Co-authored-by: Jan Tilly <[email protected]> * Move helpers into `_utils` (#782) * Patch docstring * Update CHANGELOG.rst Co-authored-by: Luca Bittarello <[email protected]> * Apply suggestions from code review Co-authored-by: Luca Bittarello <[email protected]> * shorten docstrings of private functions; typos in defaults; other suggestions * context docstring * kwargs * no context as default; small cleanups * add explanation to get calling scope * adjust to tabmat release * keep whitespace * temporarily add tabmat_dev channel again to investigate env solving failure on CI * remove tabmat_dev channel again * for now, disable conda build test on osx and Python 3.12 * Add a different environment for macos (#786) * try solving on ci with different env for macos * add missing if * typo * try and remove --no-test flag * replace deprecated scipy.sparse.*_matrix.A * replace other instance of .A * two more * simply replace all instances of .A by .toarray() (tabmat knows both) * update CHANGELOG for release --------- Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> Co-authored-by: Jan Tilly <[email protected]> Co-authored-by: Marc-Antoine Schmidt <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Martin Stancsics <[email protected]> Co-authored-by: Luca Bittarello <[email protected]> Co-authored-by: lbittarello <[email protected]>

MartinStancsicsQC mentioned this pull request Aug 2, 2023

Support initializing matrices with Patsy? Quantco/tabmat#145

Closed

MartinStancsicsQC mentioned this pull request Aug 2, 2023

Interactions #583

Closed

MartinStancsicsQC requested a review from MatthiasSchmidtblaicherQC August 3, 2023 14:23

MarcAntoineSchmidtQC added the on hold not now, maybe never label Aug 7, 2023

matthewwardrop reviewed Aug 12, 2023

View reviewed changes

src/glum/_glm.py Outdated Show resolved Hide resolved

MartinStancsicsQC added this to the Glum 3.0 milestone Aug 14, 2023

stanmart added 12 commits August 14, 2023 15:27

Add formulaic to dependencies

12a6942

Add function for transforming the formula

cc50251

Add tests

466c2a5

First draft of glum formula interface

143da05

Fixes and tests

da3f934

Handle intercept correctly

b1143ba

Add formula functionality to glm_cv

dcbc5d8

Variables from local context

ce586f4

Test predict with formulas

30f78df

Add formula tutorial

003443a

Fix tutorial

a685fff

Reformat tutorial

ea2f082

MartinStancsicsQC force-pushed the formula branch from dda6fde to ea2f082 Compare August 14, 2023 13:31

MartinStancsicsQC changed the base branch from main to glum-v3 August 14, 2023 13:32

stanmart added 4 commits August 14, 2023 15:35

Improve function signatures adn docstrings

d6c10fa

Merge branch 'glum-v3' into formula

5182bbf

Merge branch 'glum-v3' into formula

29e4710

Handle two-sided formulas in covariance_matrix

20824c2

MartinStancsicsQC marked this pull request as ready for review August 22, 2023 05:46

MartinStancsicsQC requested review from tbenthompson, MarcAntoineSchmidtQC, xhochy and jtilly as code owners August 22, 2023 05:46

MartinStancsicsQC requested a review from lbittarello as a code owner August 22, 2023 05:46

Make mypy happy about module names

b462816

MatthiasSchmidtblaicherQC requested changes Aug 22, 2023

View reviewed changes

Matthias' suggestions

22812cc

MartinStancsicsQC requested a review from MatthiasSchmidtblaicherQC August 22, 2023 18:25

MatthiasSchmidtblaicherQC approved these changes Aug 23, 2023

View reviewed changes

MartinStancsicsQC removed the on hold not now, maybe never label Aug 24, 2023

stanmart added 2 commits August 24, 2023 16:47

Merge branch 'glum-v3' into formula

d6fc8b5

Improve tutorial

78262bc

MatthiasSchmidtblaicherQC reviewed Aug 25, 2023

View reviewed changes

MarcAntoineSchmidtQC approved these changes Aug 25, 2023

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

docs/tutorials/formula_interface/formula_interface.ipynb Show resolved Hide resolved

Improve tutorial

fa0316a

MartinStancsicsQC merged commit 9a20282 into glum-v3 Aug 28, 2023
11 checks passed

MartinStancsicsQC deleted the formula branch August 28, 2023 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formula interface #670

Formula interface #670

MartinStancsicsQC commented Aug 2, 2023 •

edited

Loading

MartinStancsicsQC commented Aug 2, 2023

matthewwardrop left a comment

MatthiasSchmidtblaicherQC Aug 22, 2023 •

edited

Loading

MartinStancsicsQC Aug 22, 2023 •

edited

Loading

MartinStancsicsQC Aug 25, 2023

MatthiasSchmidtblaicherQC left a comment

MatthiasSchmidtblaicherQC Aug 25, 2023 •

edited

Loading

MarcAntoineSchmidtQC left a comment

Formula interface #670

Formula interface #670

Conversation

MartinStancsicsQC commented Aug 2, 2023 • edited Loading

Summary

Related tabmat changes

New dependencies

Example

MartinStancsicsQC commented Aug 2, 2023

matthewwardrop left a comment

Choose a reason for hiding this comment

MatthiasSchmidtblaicherQC Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

MartinStancsicsQC Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

MartinStancsicsQC Aug 25, 2023

Choose a reason for hiding this comment

MatthiasSchmidtblaicherQC left a comment

Choose a reason for hiding this comment

MatthiasSchmidtblaicherQC Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

MarcAntoineSchmidtQC left a comment

Choose a reason for hiding this comment

MartinStancsicsQC commented Aug 2, 2023 •

edited

Loading

Related `tabmat` changes

MatthiasSchmidtblaicherQC Aug 22, 2023 •

edited

Loading

MartinStancsicsQC Aug 22, 2023 •

edited

Loading

MatthiasSchmidtblaicherQC Aug 25, 2023 •

edited

Loading