glum v3.0 (#677)

* Make tests green with densematrix-refactor branch * Remove most Matrixbase subclass checks * Simplify _group_sum * Pre-commit autoupdate (#672) * Use boa in CI. (#673) * Fix covariance matrix mutating feature names (#671) * Do not use _set_up_... in covariance_matrix * Add changelog entry * Add the option to store the covariance matrix to avoid recomputing it (#661) * Add option to store covariance matrix during fit * Fix fitting with variance matrix estimation `.covariance_matrix()` expects X and weights in a different format than what we have at the end of `.fit(). * Store covariance matrix after estimation * Handle the alpha_search and glm_cv cases * Propagate covariance parameters * Add changelog * Slightly more lenient tests * Pre-commit autoupdate (#676) Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> * Fix covariance_matrix dtypes * Make CI use pre-release tabmat * Column names à la Tabmat #278 (#678) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Formula interface (#670) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Improve tutorial * Improve tutorial * Formula- and term-based Wald-tests (#689) * Add formulaic to dependencies * Add function for transforming the formula * Add tests * First draft of glum formula interface * Fixes and tests * Handle intercept correctly * Add formula functionality to glm_cv * Variables from local context * Test predict with formulas * Add formula tutorial * Fix tutorial * Reformat tutorial * Improve function signatures adn docstrings * Handle two-sided formulas in covariance_matrix * Make mypy happy about module names * Matthias' suggestions * Add back term-based Wald-tests * Tests for term names * Add formula-based Wald-test * Tests for formula-based Wald-test * Add changelog * Fix exception message * Additional test case * make docstrings clearer in the case of terms * Support for missing values in categorical columns (#684) * Delegate column naming to tabmat * Add tests * More tests * Test for dropping complete categories * Add docstrings for new argument * Add changelog entry * Convert to pandas at the correct place * Reorganize converting from pandas * Remove xfail from test * Implement missing categorical support * Add test * Solve adding missing category when predicting * Apply Matthias' suggestions * Add changelog entry * Fix formula context (#691) * Make tests fail * Propagate context through methods * pyupgrade * ensure_full_rank != drop_first * fix * move feature name assignment to right spot * fix * remove blank line * bump minimum formulaic version (stateful transforms) * improve backward compatibility * Remove code that is not needed in tabmat v4 / glum v3 (#741) * Remove check_array from predict() We don't need it here as predict calls linear_redictor, and the latter does this check. We can avoid doing it twice. * Remove _name_categorical_variable parts There is no need for those as Tabmat v4 handles variable names internally. --------- Co-authored-by: Martin Stancsics <[email protected]> * Fix formula test: consider presence of intercept in full rankness check when constructing the model matrix externally (#746) * deal with intercept in formula test correctly * naming [skip ci] * test varying significance level in coef table test (#749) * pin formulaic to 0.6 (#752) * Add illustration of formula interface to example in README (#751) * add illustration of formula to readme * rephrase * spacing * add linear term for illustration * Determine presence of intercept only by `fit_intercept` argument (#747) * always use self.fit_intercept; raise if formula conflicts with it * wording [skip ci] * adjust other tests, cosmetics * don't compare specs with singular matrix to smf * fix smf test formula * fix intercept in context test * remove outdated sentence; clean up * fix * adjust tutorial * adjust tutorial * consistent linebreaks in docstring * remove obsolete arg in docstring * Informative error when encountering categories that were not seen in training (#748) * drop missings not seen in training * zero not drop * better (?) name [skip ci] * catch case of unseen missings and fail method * fix * respect categorical missing method with formula; test different categorical missing methods also with formula * shorten the tests * dont allow fitting in case of conversion of categoricals and presence of formula * clearer error msg * also change the error msg in the regex (facepalm) * remove matches * fix * better name * describe more restrictive behavior in tutorial * Raise error on unseen levels when predicting * Allow cat_missing_method='convert' again * Update test * Check for unseen categories * Adapt align_df_categories tests to changes * Make pre-commit happy * Avoid unnecessary work * Correctly expand penalties with categoricals and `cat_missing_method="convert"` (#753) * Correctyl expand penalties when cat_missing_method=convert * Add test * Improve variable names Co-authored-by: Matthias Schmidtblaicher <[email protected]> --------- Co-authored-by: Matthias Schmidtblaicher <[email protected]> * bump tabmat pre-release version --------- Co-authored-by: Martin Stancsics <[email protected]> * docstring cosmetics * even more docstring cosmetics * Do not fail when an estimator misses class members that are new in v3 (#757) * do not fail on missing class members that are new in v3 * simplify * convert * shorten the comment * simplify * don't use getattr unnecessarily * cosmetics * fix unrelated typo * tiny cosmetics [skip ci] * No regularization as default (#758) * set alpha=0 as default * fix docstring * add alpha where needed to avoid LinAlgError * add changelog entry * also set alpha in golden master * change name in persisted file too * set alpha in model_parameters again * don't modify case of no alpha attribute, which is RegressorCV * remove invalid alpha argument * wording * Improve code readability * Make arguments to public methods except `X`, `y`, `sample_weight` and `offset` keyword-only and make initialization keyword-only (#764) * make all args except X, y, sample_weight, offset keyword only; make initialization keyword only * add changelog [skip ci] * mention that also RegressorBase was changed [skip ci] * fix import * clean up changelog * Restructure distributions (#768) * Explain `scale_predictors` more (#778) * Expand on effect of scale_predictors and remove note * Update src/glum/_glm.py Co-authored-by: Jan Tilly <[email protected]> * remove sentence --------- Co-authored-by: Jan Tilly <[email protected]> * Move helpers into `_utils` (#782) * Patch docstring * Update CHANGELOG.rst Co-authored-by: Luca Bittarello <[email protected]> * Apply suggestions from code review Co-authored-by: Luca Bittarello <[email protected]> * shorten docstrings of private functions; typos in defaults; other suggestions * context docstring * kwargs * no context as default; small cleanups * add explanation to get calling scope * adjust to tabmat release * keep whitespace * temporarily add tabmat_dev channel again to investigate env solving failure on CI * remove tabmat_dev channel again * for now, disable conda build test on osx and Python 3.12 * Add a different environment for macos (#786) * try solving on ci with different env for macos * add missing if * typo * try and remove --no-test flag * replace deprecated scipy.sparse.*_matrix.A * replace other instance of .A * two more * simply replace all instances of .A by .toarray() (tabmat knows both) * update CHANGELOG for release --------- Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com> Co-authored-by: Jan Tilly <[email protected]> Co-authored-by: Marc-Antoine Schmidt <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Matthias Schmidtblaicher <[email protected]> Co-authored-by: Martin Stancsics <[email protected]> Co-authored-by: Luca Bittarello <[email protected]> Co-authored-by: lbittarello <[email protected]>
Quantco · Apr 27, 2024 · 653b419 · 653b419
1 parent 954abc6
commit 653b419
Show file tree

Hide file tree

Showing 26 changed files with 4,629 additions and 1,390 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -40,14 +40,24 @@ jobs:
     steps:
       - name: Checkout branch
         uses: actions/checkout@v4
-      - name: Set up conda env
+      - name: Set up conda env (windows and ubuntu)
+        if: matrix.os != 'macos-latest'
         uses: mamba-org/setup-micromamba@422500192359a097648154e8db4e39bdb6c6eed7
         with:
           environment-file: environment.yml
           init-shell: ${{ matrix.os == 'windows-latest' && 'powershell' || 'bash' }}
           cache-environment: true
           create-args: >-
             python=${{ matrix.python-version }}
+      - name: Set up conda env (macos)
+        if: matrix.os == 'macos-latest'
+        uses: mamba-org/setup-micromamba@422500192359a097648154e8db4e39bdb6c6eed7
+        with:
+          environment-file: environment-macos.yml
+          init-shell: bash
+          cache-environment: true
+          create-args: >-
+            python=${{ matrix.python-version }}
       - name: Install repository (unix)
         if: matrix.os != 'windows-latest'
         shell: bash -el {0}

diff --git a/.github/workflows/conda-build.yml b/.github/workflows/conda-build.yml
@@ -24,7 +24,7 @@ jobs:
           - { conda_build_yml: linux_64_python3.12.____cpython,  os: ubuntu-latest,  conda-build-args: '' }
           - { conda_build_yml: osx_64_python3.9.____cpython,     os: macos-latest,   conda-build-args: '' }
           - { conda_build_yml: osx_64_python3.12.____cpython,    os: macos-latest,   conda-build-args: '' }
-          - { conda_build_yml: osx_arm64_python3.10.____cpython, os: macos-latest,   conda-build-args: ' --no-test' }
+          - { conda_build_yml: osx_arm64_python3.10.____cpython, os: macos-latest,   conda-build-args: '' }
           - { conda_build_yml: win_64_python3.9.____cpython,     os: windows-latest, conda-build-args: '' }
           - { conda_build_yml: win_64_python3.12.____cpython,    os: windows-latest, conda-build-args: '' }
     steps:

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -7,6 +7,28 @@
 Changelog
 =========
 
+3.0.0 - 2024-04-27
+------------------
+
+**Breaking changes:**
+
+- All arguments to :class:`~glum.GeneralizedLinearRegressorBase`, :class:`~glum.GeneralizedLinearRegressor` and :class:`GeneralizedLinearRegressorCV` are now keyword-only.
+- All arguments to public methods of :class:`~glum.GeneralizedLinearRegressorBase`, :class:`~glum.GeneralizedLinearRegressor` or :class:`GeneralizedLinearRegressorCV` except ``X``, ``y``, ``sample_weight`` and ``offset`` are now keyword-only.
+- :class:`~glum.GeneralizedLinearRegressor`'s default value for ``alpha`` is now ``0``, i.e. no regularization.
+- :class:`~glum.GammaDistribution`, :class:`~glum.InverseGaussianDistribution`, :class:`~glum.NormalDistribution` and :class:`~glum.PoissonDistribution` no longer inherit from :class:`~glum.TweedieDistribution`.
+- The power parameter of :class:`~glum.TweedieLink` has been renamed from ``p`` to ``power``, in line with :class:`~glum.TweedieDistribution`.
+- :class:`~glum.TweedieLink` no longer instantiates :class:`~glum.IdentityLink` or :class:`~glum.LogLink` for ``power=0`` and ``power=1``, respectively. On the other hand, :class:`~glum.TweedieLink` is now compatible with ``power=0`` and ``power=1``.
+
+**New features:**
+
+- Added a formula interface for specifying models.
+- Improved feature name handling. Feature names are now created for non-pandas input matrices too. Furthermore, the format of categorical features can be specified by the user.
+- Term names are now stored in the model's attributes. This is useful for categorical features, where they refer to the whole variable, not just single levels.
+- Added more options for treating missing values in categorical columns. They can either raise a ``ValueError`` (``"fail"``), be treated as all-zero indicators (``"zero"``) or represented as a new category (``"convert"``).
+- `meth:GeneralizedLinearRegressor.wald_test` can now perform tests based on a formula string and term names.
+- :class:`~glum.InverseGaussianDistribution` gains a :meth:`~glum.InverseGaussianDistribution.log_likelihood` method.
+
+
 2.7.0 - 2024-02-19
 ------------------
 
@@ -16,7 +38,7 @@ Changelog
 
 **Other changes:**
 
-- Require Python>=3.9 in line with `NEP 29 <https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table>`_
+- Require Python>=3.9 in line with `NEP 29 <https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table>`.
 - Build and test with Python 3.12 in CI.
 - Added line search stopping criterion for tiny loss improvements based on gradient information.
 - Added warnings about breaking changes in future versions.
@@ -73,6 +95,7 @@ Changelog
   :class:`~glum.GeneralizedLinearRegressor` and :class:`~glum.GeneralizedLinearRegressorCV`
   to ``'negative.binomial'``.
 
+
 2.4.1 - 2023-03-14
 ------------------
 

diff --git a/README.md b/README.md
@@ -68,7 +68,7 @@ Why did we choose the name `glum`? We wanted a name that had the letters GLM and
 >>>
 >>> _ = model.fit(X=X, y=y)
 >>>
->>> # .report_diagnostics shows details about the steps taken by the iterative solver
+>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
 >>> diags = model.get_formatted_diagnostics(full_report=True)
 >>> diags[['objective_fct']]
         objective_fct
@@ -79,6 +79,15 @@ n_iter
 3            0.443681
 4            0.443498
 5            0.443497
+>>>
+>>> # Models can also be built with formulas from formulaic.
+>>> model_formula = GeneralizedLinearRegressor(
+...     family='binomial',
+...     l1_ratio=1.0,
+...     alpha=0.001,
+...     formula="bedrooms + np.log(bathrooms + 1) + bs(sqft_living, 3) + C(waterfront)"
+... )
+>>> _ = model_formula.fit(X=house_data.data, y=y)
 
 ```
 

diff --git a/conda.recipe/meta.yaml b/conda.recipe/meta.yaml
@@ -35,7 +35,8 @@ requirements:
     - pandas
     - scikit-learn >=0.23
     - scipy
-    - tabmat >=3.1.0, <4.0.0
+    - formulaic >=0.6
+    - tabmat >=4.0.0
 
 test:
   requires: