Skip to content

Commit

Permalink
glum v3.0 (#677)
Browse files Browse the repository at this point in the history
* Make tests green with densematrix-refactor branch

* Remove most Matrixbase subclass checks

* Simplify _group_sum

* Pre-commit autoupdate (#672)

* Use boa in CI. (#673)

* Fix covariance matrix mutating feature names (#671)

* Do not use _set_up_... in covariance_matrix

* Add changelog entry

* Add the option to store the covariance matrix to avoid recomputing it (#661)

* Add option to store covariance matrix during fit

* Fix fitting with variance matrix estimation

`.covariance_matrix()` expects X and weights in a different format than
what we have at the end of `.fit().

* Store covariance matrix after estimation

* Handle the alpha_search and glm_cv cases

* Propagate covariance parameters

* Add changelog

* Slightly more lenient tests

* Pre-commit autoupdate (#676)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Fix covariance_matrix dtypes

* Make CI use pre-release tabmat

* Column names  à la Tabmat #278 (#678)

* Delegate column naming to tabmat

* Add tests

* More tests

* Test for dropping complete categories

* Add docstrings for new argument

* Add changelog entry

* Convert to pandas at the correct place

* Reorganize converting from pandas

* Remove xfail from test

* Formula interface (#670)

* Add formulaic to dependencies

* Add function for transforming the formula

* Add tests

* First draft of glum formula interface

* Fixes and tests

* Handle intercept correctly

* Add formula functionality to glm_cv

* Variables from local context

* Test predict with formulas

* Add formula tutorial

* Fix tutorial

* Reformat tutorial

* Improve function signatures adn docstrings

* Handle two-sided formulas in covariance_matrix

* Make mypy happy about module names

* Matthias' suggestions

* Improve tutorial

* Improve tutorial

* Formula- and term-based Wald-tests (#689)

* Add formulaic to dependencies

* Add function for transforming the formula

* Add tests

* First draft of glum formula interface

* Fixes and tests

* Handle intercept correctly

* Add formula functionality to glm_cv

* Variables from local context

* Test predict with formulas

* Add formula tutorial

* Fix tutorial

* Reformat tutorial

* Improve function signatures adn docstrings

* Handle two-sided formulas in covariance_matrix

* Make mypy happy about module names

* Matthias' suggestions

* Add back term-based Wald-tests

* Tests for term names

* Add formula-based Wald-test

* Tests for formula-based Wald-test

* Add changelog

* Fix exception message

* Additional test case

* make docstrings clearer in the case of terms

* Support for missing values in categorical columns (#684)

* Delegate column naming to tabmat

* Add tests

* More tests

* Test for dropping complete categories

* Add docstrings for new argument

* Add changelog entry

* Convert to pandas at the correct place

* Reorganize converting from pandas

* Remove xfail from test

* Implement missing categorical support

* Add test

* Solve adding missing category when predicting

* Apply Matthias' suggestions

* Add changelog entry

* Fix formula context (#691)

* Make tests fail

* Propagate context through methods

* pyupgrade

* ensure_full_rank != drop_first

* fix

* move feature name assignment to right spot

* fix

* remove blank line

* bump minimum formulaic version (stateful transforms)

* improve backward compatibility

* Remove code that is not needed in tabmat v4 / glum v3 (#741)

* Remove check_array from predict()

We don't need it here as predict calls linear_redictor, and the latter does this check. We can avoid doing it twice.

* Remove _name_categorical_variable parts

There is no need for those as Tabmat v4 handles variable names internally.

---------

Co-authored-by: Martin Stancsics <[email protected]>

* Fix formula test: consider presence of intercept in full rankness check when constructing the model matrix externally (#746)

* deal with intercept in formula test correctly

* naming [skip ci]

* test varying significance level in coef table test (#749)

* pin formulaic to 0.6 (#752)

* Add illustration of formula interface to example in README (#751)

* add illustration of formula to readme

* rephrase

* spacing

* add linear term for illustration

* Determine presence of intercept only by `fit_intercept` argument (#747)

* always use self.fit_intercept; raise if formula conflicts with it

* wording [skip ci]

* adjust other tests, cosmetics

* don't compare specs with singular matrix to smf

* fix smf test formula

* fix intercept in context test

* remove outdated sentence; clean up

* fix

* adjust tutorial

* adjust tutorial

* consistent linebreaks in docstring

* remove obsolete arg in docstring

* Informative error when encountering categories that were not seen in training (#748)

* drop missings not seen in training

* zero not drop

* better (?) name [skip ci]

* catch case of unseen missings and fail method

* fix

* respect categorical missing method with formula; test different categorical missing methods also with formula

* shorten the tests

* dont allow fitting in case of conversion of categoricals and presence of formula

* clearer error msg

* also change the error msg in the regex (facepalm)

* remove matches

* fix

* better name

* describe more restrictive behavior in tutorial

* Raise error on unseen levels when predicting

* Allow cat_missing_method='convert' again

* Update test

* Check for unseen categories

* Adapt align_df_categories tests to changes

* Make pre-commit happy

* Avoid unnecessary work

* Correctly expand penalties with categoricals and `cat_missing_method="convert"` (#753)

* Correctyl expand penalties when cat_missing_method=convert

* Add test

* Improve variable names

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

---------

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

* bump tabmat pre-release version

---------

Co-authored-by: Martin Stancsics <[email protected]>

* docstring cosmetics

* even more docstring cosmetics

* Do not fail when an estimator misses class members that are new in v3 (#757)

* do not fail on missing class members that are new in v3

* simplify

* convert

* shorten the comment

* simplify

* don't use getattr unnecessarily

* cosmetics

* fix unrelated typo

* tiny cosmetics [skip ci]

* No regularization as default (#758)

* set alpha=0 as default

* fix docstring

* add alpha where needed to avoid LinAlgError

* add changelog entry

* also set alpha in golden master

* change name in persisted file too

* set alpha in model_parameters again

* don't modify case of no alpha attribute, which is RegressorCV

* remove invalid alpha argument

* wording

* Improve code readability

* Make arguments to public methods except `X`, `y`, `sample_weight` and `offset` keyword-only and make initialization keyword-only (#764)

* make all args except X, y, sample_weight, offset keyword only; make initialization keyword only

* add changelog [skip ci]

* mention that also RegressorBase was changed [skip ci]

* fix import

* clean up changelog

* Restructure distributions (#768)

* Explain `scale_predictors` more (#778)

* Expand on effect of scale_predictors and remove note

* Update src/glum/_glm.py

Co-authored-by: Jan Tilly <[email protected]>

* remove sentence

---------

Co-authored-by: Jan Tilly <[email protected]>

* Move helpers into `_utils` (#782)

* Patch docstring

* Update CHANGELOG.rst

Co-authored-by: Luca Bittarello <[email protected]>

* Apply suggestions from code review

Co-authored-by: Luca Bittarello <[email protected]>

* shorten docstrings of private functions; typos in defaults; other suggestions

* context docstring

* kwargs

* no context as default; small cleanups

* add explanation to get calling scope

* adjust to tabmat release

* keep whitespace

* temporarily add tabmat_dev channel again to investigate env solving failure on CI

* remove tabmat_dev channel again

* for now, disable conda build test on osx and Python 3.12

* Add a different environment for macos (#786)

* try solving on ci with different env for macos

* add missing if

* typo

* try and remove --no-test flag

* replace deprecated scipy.sparse.*_matrix.A

* replace other instance of .A

* two more

* simply replace all instances of .A by .toarray() (tabmat knows both)

* update CHANGELOG for release

---------

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: Jan Tilly <[email protected]>
Co-authored-by: Marc-Antoine Schmidt <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
Co-authored-by: Luca Bittarello <[email protected]>
Co-authored-by: lbittarello <[email protected]>
  • Loading branch information
9 people authored Apr 27, 2024
1 parent 954abc6 commit 653b419
Show file tree
Hide file tree
Showing 26 changed files with 4,629 additions and 1,390 deletions.
12 changes: 11 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,24 @@ jobs:
steps:
- name: Checkout branch
uses: actions/checkout@v4
- name: Set up conda env
- name: Set up conda env (windows and ubuntu)
if: matrix.os != 'macos-latest'
uses: mamba-org/setup-micromamba@422500192359a097648154e8db4e39bdb6c6eed7
with:
environment-file: environment.yml
init-shell: ${{ matrix.os == 'windows-latest' && 'powershell' || 'bash' }}
cache-environment: true
create-args: >-
python=${{ matrix.python-version }}
- name: Set up conda env (macos)
if: matrix.os == 'macos-latest'
uses: mamba-org/setup-micromamba@422500192359a097648154e8db4e39bdb6c6eed7
with:
environment-file: environment-macos.yml
init-shell: bash
cache-environment: true
create-args: >-
python=${{ matrix.python-version }}
- name: Install repository (unix)
if: matrix.os != 'windows-latest'
shell: bash -el {0}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/conda-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:
- { conda_build_yml: linux_64_python3.12.____cpython, os: ubuntu-latest, conda-build-args: '' }
- { conda_build_yml: osx_64_python3.9.____cpython, os: macos-latest, conda-build-args: '' }
- { conda_build_yml: osx_64_python3.12.____cpython, os: macos-latest, conda-build-args: '' }
- { conda_build_yml: osx_arm64_python3.10.____cpython, os: macos-latest, conda-build-args: ' --no-test' }
- { conda_build_yml: osx_arm64_python3.10.____cpython, os: macos-latest, conda-build-args: '' }
- { conda_build_yml: win_64_python3.9.____cpython, os: windows-latest, conda-build-args: '' }
- { conda_build_yml: win_64_python3.12.____cpython, os: windows-latest, conda-build-args: '' }
steps:
Expand Down
25 changes: 24 additions & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,28 @@
Changelog
=========

3.0.0 - 2024-04-27
------------------

**Breaking changes:**

- All arguments to :class:`~glum.GeneralizedLinearRegressorBase`, :class:`~glum.GeneralizedLinearRegressor` and :class:`GeneralizedLinearRegressorCV` are now keyword-only.
- All arguments to public methods of :class:`~glum.GeneralizedLinearRegressorBase`, :class:`~glum.GeneralizedLinearRegressor` or :class:`GeneralizedLinearRegressorCV` except ``X``, ``y``, ``sample_weight`` and ``offset`` are now keyword-only.
- :class:`~glum.GeneralizedLinearRegressor`'s default value for ``alpha`` is now ``0``, i.e. no regularization.
- :class:`~glum.GammaDistribution`, :class:`~glum.InverseGaussianDistribution`, :class:`~glum.NormalDistribution` and :class:`~glum.PoissonDistribution` no longer inherit from :class:`~glum.TweedieDistribution`.
- The power parameter of :class:`~glum.TweedieLink` has been renamed from ``p`` to ``power``, in line with :class:`~glum.TweedieDistribution`.
- :class:`~glum.TweedieLink` no longer instantiates :class:`~glum.IdentityLink` or :class:`~glum.LogLink` for ``power=0`` and ``power=1``, respectively. On the other hand, :class:`~glum.TweedieLink` is now compatible with ``power=0`` and ``power=1``.

**New features:**

- Added a formula interface for specifying models.
- Improved feature name handling. Feature names are now created for non-pandas input matrices too. Furthermore, the format of categorical features can be specified by the user.
- Term names are now stored in the model's attributes. This is useful for categorical features, where they refer to the whole variable, not just single levels.
- Added more options for treating missing values in categorical columns. They can either raise a ``ValueError`` (``"fail"``), be treated as all-zero indicators (``"zero"``) or represented as a new category (``"convert"``).
- `meth:GeneralizedLinearRegressor.wald_test` can now perform tests based on a formula string and term names.
- :class:`~glum.InverseGaussianDistribution` gains a :meth:`~glum.InverseGaussianDistribution.log_likelihood` method.


2.7.0 - 2024-02-19
------------------

Expand All @@ -16,7 +38,7 @@ Changelog

**Other changes:**

- Require Python>=3.9 in line with `NEP 29 <https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table>`_
- Require Python>=3.9 in line with `NEP 29 <https://numpy.org/neps/nep-0029-deprecation_policy.html#support-table>`.
- Build and test with Python 3.12 in CI.
- Added line search stopping criterion for tiny loss improvements based on gradient information.
- Added warnings about breaking changes in future versions.
Expand Down Expand Up @@ -73,6 +95,7 @@ Changelog
:class:`~glum.GeneralizedLinearRegressor` and :class:`~glum.GeneralizedLinearRegressorCV`
to ``'negative.binomial'``.


2.4.1 - 2023-03-14
------------------

Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ Why did we choose the name `glum`? We wanted a name that had the letters GLM and
>>>
>>> _ = model.fit(X=X, y=y)
>>>
>>> # .report_diagnostics shows details about the steps taken by the iterative solver
>>> # .report_diagnostics shows details about the steps taken by the iterative solver.
>>> diags = model.get_formatted_diagnostics(full_report=True)
>>> diags[['objective_fct']]
objective_fct
Expand All @@ -79,6 +79,15 @@ n_iter
3 0.443681
4 0.443498
5 0.443497
>>>
>>> # Models can also be built with formulas from formulaic.
>>> model_formula = GeneralizedLinearRegressor(
... family='binomial',
... l1_ratio=1.0,
... alpha=0.001,
... formula="bedrooms + np.log(bathrooms + 1) + bs(sqft_living, 3) + C(waterfront)"
... )
>>> _ = model_formula.fit(X=house_data.data, y=y)

```

Expand Down
3 changes: 2 additions & 1 deletion conda.recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ requirements:
- pandas
- scikit-learn >=0.23
- scipy
- tabmat >=3.1.0, <4.0.0
- formulaic >=0.6
- tabmat >=4.0.0

test:
requires:
Expand Down
Loading

0 comments on commit 653b419

Please sign in to comment.