Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informative error when encountering categories that were not seen in training #748

Merged
merged 27 commits into from
Jan 29, 2024

Conversation

MatthiasSchmidtblaicherQC
Copy link
Contributor

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC commented Jan 10, 2024

Categories that were not seen at training, including NA's, should lead to an informative error. Currently, the error messages are not that clear (see below). There should be tests for the behavior with unseen categories.

Old description, prior to discussion below:
Conversion of missing categoricals into their own categories (cat_missing_method=="convert") if the missings have not been observed in training is currently not handled well. This PR addresses this for two different cases:

  • in the standard model without formula, we drop categories for a missings that were not seen in training, just as what we would do for any other category. Currently, new categories are created at prediction for these, leading to a failure in prediction.
  • in the case that the model that the model is built with a formula, we don't allow for cat_missing_method=="convert". If there is really need for his feature if there, then we can add it in the future via changing the TabmatMaterializer to align categories between training and prediction.`

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title cat_missing_method == "convert": drop missing category if not seen in training cat_missing_method == "convert": drop missing category that was not seen in training Jan 10, 2024
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title cat_missing_method == "convert": drop missing category that was not seen in training Special cases with categorical missing method Jan 11, 2024
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title Special cases with categorical missing method Missing categoricals that were not seen in training Jan 11, 2024
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC marked this pull request as ready for review January 11, 2024 16:07
@stanmart
Copy link
Collaborator

Thanks a lot for handling this annoying case. I agree that the current way of handling no missings in the training data is less than ideal. I generally like this solution, with two caveats.

  • I can get behind treating missings as all-zero dummies when the training data does not contain missings, but I'm not sure it is what currently (glum 2.6) happens to any other, non-missing category. I think the prediction actually fails in that case, which is closer to what happens in the original code (although this case is not handled explicitly, so the error message is probably too vague). I might be misunderstanding it though, please see the example code below.
In [1]: import glum
In [2]: import pandas as pd

In [3]: df_train = pd.DataFrame({
    ...:     "x": pd.Categorical(["a", "b", "a", "b"]),
    ...:     "y": [1., 2., 3., 4.],
    ...: })
In [4]: df_test = pd.DataFrame({"x": pd.Categorical(["a", "b", "c"]}))

In [5]: model.predict(df_test[:2])
Out[5]: array([2.33333333, 2.66666667])

In [6]: model.predict(df_test)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[23], line 1
----> 1 model.predict(df_test)

File ~/micromamba/envs/glum/lib/python3.12/site-packages/glum/_glm.py:1317, in GeneralizedLinearRegressorBase.predict(self, X, sample_weight, offset, alpha_index, alpha)
   1314 if isinstance(X, pd.DataFrame) and hasattr(self, "feature_dtypes_"):
   1315     X = _align_df_categories(X, self.feature_dtypes_)
-> 1317 X = check_array_tabmat_compliant(
   1318     X,
   1319     accept_sparse=["csr", "csc", "coo"],
   1320     dtype="numeric",
   1321     copy=self._should_copy_X(),
   1322     ensure_2d=True,
   1323     allow_nd=False,
   1324     drop_first=self.drop_first,
   1325 )
   1326 eta = self.linear_predictor(
   1327     X, offset=offset, alpha_index=alpha_index, alpha=alpha
   1328 )
   1329 mu = get_link(self.link, get_family(self.family)).inverse(eta)

File ~/micromamba/envs/glum/lib/python3.12/site-packages/glum/_glm.py:96, in check_array_tabmat_compliant(mat, drop_first, **kwargs)
     93 to_copy = kwargs.get("copy", False)
     95 if isinstance(mat, pd.DataFrame) and any(mat.dtypes == "category"):
---> 96     mat = tm.from_pandas(mat, drop_first=drop_first)
     98 if isinstance(mat, tm.SplitMatrix):
     99     kwargs.update({"ensure_min_features": 0})

File ~/micromamba/envs/glum/lib/python3.12/site-packages/tabmat/constructor.py:75, in from_pandas(df, dtype, sparse_threshold, cat_threshold, object_as_cat, cat_position, drop_first)
     73     coldata = coldata.astype("category")
     74 if isinstance(coldata.dtype, pd.CategoricalDtype):
---> 75     cat = CategoricalMatrix(coldata, drop_first=drop_first, dtype=dtype)
     76     if len(coldata.cat.categories) < cat_threshold:
     77         (
     78             X_dense_F,
     79             X_sparse,
   (...)
     84             threshold=sparse_threshold,
     85         )

File ~/micromamba/envs/glum/lib/python3.12/site-packages/tabmat/categorical_matrix.py:255, in CategoricalMatrix.__init__(self, cat_vec, drop_first, dtype)
    248 def __init__(
    249     self,
    250     cat_vec: Union[list, np.ndarray, pd.Categorical],
    251     drop_first: bool = False,
    252     dtype: np.dtype = np.float64,
    253 ):
    254     if pd.isnull(cat_vec).any():
--> 255         raise ValueError("Categorical data can't have missing values.")
    257     if isinstance(cat_vec, pd.Categorical):
    258         self.cat = cat_vec

ValueError: Categorical data can't have missing values.
  • I'm also okay with simply forbidding cat_missing_method="convert" when formulas are used, as people can do the conversion as a pre-processing step themselves. However, the current implementation only raises an exception when it is set outside of the formula, and not when it is set within a C() expression. I'm not really sure what is the best way to go about it, as parsing the formulas within glum is probably not a good way to go. OTOH, I would not forbid cat_missing_method="convert" in tabmat's formula interface, as it might be useful, and there is no training/prediction problem there. Maybe it's okay to leave it as it is, as people using complex C() functions in formulas can be expected to know what they are doing?

What do you think?

@MatthiasSchmidtblaicherQC
Copy link
Contributor Author

MatthiasSchmidtblaicherQC commented Jan 22, 2024

I'm not sure it is what currently (glum 2.6) happens to any other, non-missing category

I am getting the behavior that I describe with 2.6.0, see below. One of us must be running this on the wrong version, please double check, it could well be me :).

# %%
import pandas as pd
from importlib.metadata import version

from glum import GeneralizedLinearRegressor

# %%
df_test = pd.DataFrame({"x": pd.Categorical(["a", "b", "c"])})
df_train = pd.DataFrame({
    "x": pd.Categorical(["a", "b", "a", "b"]),
    "y": [1., 2., 3., 4.],
})
# %%
version('glum')  # 2.6.0
# %%
model = GeneralizedLinearRegressor().fit(df_train[["x"]], df_train["y"])
model.predict(df_test)  # array([2.33333333, 2.66666667, 2.5       ])
# %%
model.intercept_  # 2.5

Raising an error for unseen categories seems like a good option too, at least that would be consistent with libraries that use dummy-encoding. However, this would be a breaking change if my result holds up. Also, the current 2.6.0 behavior is not unreasonable.

However, the current implementation only raises an exception when it is set outside of the formula, and not when it is set within a C() expression.

Good observation. I agree that parsing the formulas in glum would be too much overhead, so catching the missing method inside C is not desirable. What about allowing for cat_missing_method="convert" with formulas, but raising a more informative error message when the design matrix at prediction does not fit the one at training?

Here is what sklearn's ElasticNet does in the example, as a comparison (run after the code above):

import pytest
from sklearn.linear_model import ElasticNet

model_sklearn = ElasticNet(alpha=0.01)
model_sklearn.fit(pd.get_dummies(df_train[["x"]]), df_train["y"])
# %%
with pytest.raises(ValueError, match="ValueError: The feature names should match those that were passed during fit."):
    model_sklearn.predict(pd.get_dummies(df_test[["x"]]))

@stanmart
Copy link
Collaborator

I am getting the behavior that I describe with 2.6.0, see below. One of us must be running this on the wrong version, please double check, it could well be me :).

Okay, that is weird indeed 😅 I'm getting the same ValueError with your code snippet too, both under glum 2.6.0, 3.0.0a0 and 3.0.0a2:

>>> import pandas as pd
>>> from importlib.metadata import version

>>> from glum import GeneralizedLinearRegressor

>>> df_test = pd.DataFrame({"x": pd.Categorical(["a", "b", "c"])})
>>> df_train = pd.DataFrame({
    "x": pd.Categorical(["a", "b", "a", "b"]),
    "y": [1., 2., 3., 4.],
})

>>> version("glum")
'2.6.0'

>>> model.predict(df_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/martin/micromamba/envs/glum/lib/python3.12/site-packages/glum/_glm.py", line 1317, in predict
    X = check_array_tabmat_compliant(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/martin/micromamba/envs/glum/lib/python3.12/site-packages/glum/_glm.py", line 96, in check_array_tabmat_compliant
    mat = tm.from_pandas(mat, drop_first=drop_first)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/martin/micromamba/envs/glum/lib/python3.12/site-packages/tabmat/constructor.py", line 75, in from_pandas
    cat = CategoricalMatrix(coldata, drop_first=drop_first, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/martin/micromamba/envs/glum/lib/python3.12/site-packages/tabmat/categorical_matrix.py", line 255, in __init__
    raise ValueError("Categorical data can't have missing values.")
ValueError: Categorical data can't have missing values.

Below are some details about my environment. Let's try to figure out what the difference is between our setups. (Maybe we should focus on pandas and tabmat versions as a first try?)

❯ micromamba info

       libmamba version : 1.5.6
     micromamba version : 1.5.6
           curl version : libcurl/8.5.0 OpenSSL/3.2.0 zlib/1.2.13 zstd/1.5.5 libssh2/1.11.0 nghttp2/1.58.0
     libarchive version : libarchive 3.7.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.5
       envs directories : /home/martin/micromamba/envs
          package cache : /home/martin/micromamba/pkgs
                          /home/martin/.mamba/pkgs
            environment : glum (active)
           env location : /home/martin/micromamba/envs/glum
      user config files : /home/martin/.mambarc
 populated config files : /home/martin/.condarc
       virtual packages : __unix=0=0
                          __linux=5.15.133=0
                          __glibc=2.35=0
                          __archspec=1=x86_64-v3
               channels : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
       base environment : /home/martin/micromamba
               platform : linux-64
❯ micromamba list
List of packages in environment: "/home/martin/micromamba/envs/glum"

  Name               Version       Build                Channel
─────────────────────────────────────────────────────────────────────
  _libgcc_mutex      0.1           conda_forge          conda-forge
  _openmp_mutex      4.5           2_gnu                conda-forge
  asttokens          2.4.1         pyhd8ed1ab_0         conda-forge
  bzip2              1.0.8         hd590300_5           conda-forge
  ca-certificates    2023.11.17    hbcca054_0           conda-forge
  decorator          5.1.1         pyhd8ed1ab_0         conda-forge
  exceptiongroup     1.2.0         pyhd8ed1ab_2         conda-forge
  executing          2.0.1         pyhd8ed1ab_0         conda-forge
  glum               2.6.0         py312hfb8ada1_1      conda-forge
  ipython            8.20.0        pyh707e725_0         conda-forge
  jedi               0.19.1        pyhd8ed1ab_0         conda-forge
  joblib             1.3.2         pyhd8ed1ab_0         conda-forge
  ld_impl_linux-64   2.40          h41732ed_0           conda-forge
  libblas            3.9.0         20_linux64_openblas  conda-forge
  libcblas           3.9.0         20_linux64_openblas  conda-forge
  libexpat           2.5.0         hcb278e6_1           conda-forge
  libffi             3.4.2         h7f98852_5           conda-forge
  libgcc-ng          13.2.0        h807b86a_3           conda-forge
  libgfortran-ng     13.2.0        h69a702a_3           conda-forge
  libgfortran5       13.2.0        ha4646dd_3           conda-forge
  libgomp            13.2.0        h807b86a_3           conda-forge
  libjemalloc-local  5.3.0         hcb278e6_0           conda-forge
  liblapack          3.9.0         20_linux64_openblas  conda-forge
  libnsl             2.0.1         hd590300_0           conda-forge
  libopenblas        0.3.25        pthreads_h413a1c8_0  conda-forge
  libsqlite          3.44.2        h2797004_0           conda-forge
  libstdcxx-ng       13.2.0        h7e041cc_3           conda-forge
  libuuid            2.38.1        h0b41bf4_0           conda-forge
  libxcrypt          4.4.36        hd590300_1           conda-forge
  libzlib            1.2.13        hd590300_5           conda-forge
  matplotlib-inline  0.1.6         pyhd8ed1ab_0         conda-forge
  ncurses            6.4           h59595ed_2           conda-forge
  nomkl              1.0           h5ca1d4c_0           conda-forge
  numexpr            2.8.8         py312hed3a10b_100    conda-forge
  numpy              1.26.3        py312heda63a1_0      conda-forge
  openssl            3.2.0         hd590300_1           conda-forge
  pandas             2.2.0         py312hfb8ada1_0      conda-forge
  parso              0.8.3         pyhd8ed1ab_0         conda-forge
  pexpect            4.8.0         pyh1a96a4e_2         conda-forge
  pickleshare        0.7.5         py_1003              conda-forge
  pip                23.3.2        pyhd8ed1ab_0         conda-forge
  prompt-toolkit     3.0.42        pyha770c72_0         conda-forge
  ptyprocess         0.7.0         pyhd3deb0d_0         conda-forge
  pure_eval          0.2.2         pyhd8ed1ab_0         conda-forge
  pygments           2.17.2        pyhd8ed1ab_0         conda-forge
  python             3.12.1        hab00c5b_1_cpython   conda-forge
  python-dateutil    2.8.2         pyhd8ed1ab_0         conda-forge
  python-tzdata      2023.4        pyhd8ed1ab_0         conda-forge
  python_abi         3.12          4_cp312              conda-forge
  pytz               2023.3.post1  pyhd8ed1ab_0         conda-forge
  readline           8.2           h8228510_1           conda-forge
  scikit-learn       1.4.0         py312h394d371_0      conda-forge
  scipy              1.12.0        py312heda63a1_0      conda-forge
  setuptools         69.0.3        pyhd8ed1ab_0         conda-forge
  six                1.16.0        pyh6c4a22f_0         conda-forge
  stack_data         0.6.2         pyhd8ed1ab_0         conda-forge
  tabmat             3.1.13        py312hfb8ada1_0      conda-forge
  threadpoolctl      3.2.0         pyha21a80b_0         conda-forge
  tk                 8.6.13        noxft_h4845f30_101   conda-forge
  traitlets          5.14.1        pyhd8ed1ab_0         conda-forge
  typing_extensions  4.9.0         pyha770c72_0         conda-forge
  tzdata             2023d         h0c530f3_0           conda-forge
  wcwidth            0.2.13        pyhd8ed1ab_0         conda-forge
  wheel              0.42.0        pyhd8ed1ab_0         conda-forge
  xz                 5.2.6         h166bdaf_0           conda-forge

What about allowing for cat_missing_method="convert" with formulas, but raising a more informative error message when the design matrix at prediction does not fit the one at training?

Yes, that sounds like a great solution to me.

@MatthiasSchmidtblaicherQC
Copy link
Contributor Author

I ran it on tabmat 3.1.10. When I update to 3.1.11, I get the same error as you, so this explains the difference.

I would then change the implementation such that we always raise an informative error when the model matrix at predict differs from train, if you agree. This is in line with what happens under dummy encoding and in other packages.

@stanmart
Copy link
Collaborator

Yes I think that would be great. I can also take a stab at it if you'd like.

Regarding the change in behavior due to tabmat, I now remember what the reason is. There was this pr that was probably released in 3.1.11. The relevant part is the following:

This is an implementation detail and the users should notice no change in behavior, except for one edge case: missing values in low-cardinality categoricals. Until now, from_pandas (inheriting the default behavior of pdandas.get_dummies) handled it by simply setting all of their indicator columns to 0. After this PR, such columns will raise an error.

I would probably consider the old behavior as a bug, and therefore this PR as a non-breaking change for the following reasons:

  • It was not documented anywhere.
  • High-cardinality categoricals already raise an error as CategoricalMatrix does not handle missing values. Accepting them for low-cardinality categoricals is inconsistent and surprising behavior.
  • Silently creating all-zero indicators for categoricals is not the (only) obvious way to go.

Anyways, mystery solved 🙂

@MatthiasSchmidtblaicherQC
Copy link
Contributor Author

MatthiasSchmidtblaicherQC commented Jan 23, 2024

Yes, great that it is solved. :) The text you cite is a further argument to raise an error with new categories at predict.

I can also take a stab at it if you'd like.

That would be great, thanks!

@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title Missing categoricals that were not seen in training Categoricals that were not seen in training Jan 23, 2024
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title Categoricals that were not seen in training Categories that were not seen in training Jan 23, 2024
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC changed the title Categories that were not seen in training Informative error when encountering categories that were not seen in training Jan 23, 2024
@stanmart
Copy link
Collaborator

stanmart commented Jan 24, 2024

It's somewhat trickier than I expected because we have to keep track of whether categoricals have NAs at training time if the missing method is convert. Therefore, in addition to storing the dtype dict, I add a new dict containing if the column will also have a missing category. (It turns out that this is useful to have even beyond this specific issue. As things stand now, the expansion of penalties is incorrect when categorical missings are treated as separate categories. It is a small fix for which I can submit a PR soon, but it also requires this keeping track of categoricals having missings at training time.)

The part where checking for unseen categories happens is now the _align_df_categories function in the case of non-formula based models. For formulas, I think it would be better handled in tabmat, and smaller changes would be required, so I'd rather submit a separate PR there.

Let me know what you think. It's a bit more added complexity than what I was hoping for, but I think that checking for unseen categories is important now that NAs are allowed in categoricals.

Edit: test is failing because of the aforementioned upcoming tabmat changes. They are green with the WIP version of tabmat.

@MatthiasSchmidtblaicherQC
Copy link
Contributor Author

MatthiasSchmidtblaicherQC commented Jan 24, 2024

Thanks. This is how I imagined it, but better executed! I don't have additional comments on those in the tabmat PR.

test is failing because of the aforementioned upcoming tabmat changes. They are green with the WIP version of tabmat.

Sure, will make a new pre-release of tabmat once the branch is merged in tabmat-v4.

the expansion of penalties is incorrect when categorical missings are treated as separate categories. It is a small fix for which I can submit a PR soon.

Good catch, looking forward to the PR.

stanmart and others added 2 commits January 29, 2024 14:35
…"convert"` (#753)

* Correctyl expand penalties when cat_missing_method=convert

* Add test

* Improve variable names

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

---------

Co-authored-by: Matthias Schmidtblaicher <[email protected]>
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC merged commit 1ad8be2 into glum-v3 Jan 29, 2024
14 checks passed
@MatthiasSchmidtblaicherQC MatthiasSchmidtblaicherQC deleted the convert-nas-unseen branch January 29, 2024 14:43
MatthiasSchmidtblaicherQC added a commit that referenced this pull request Apr 27, 2024
* Make tests green with densematrix-refactor branch

* Remove most Matrixbase subclass checks

* Simplify _group_sum

* Pre-commit autoupdate (#672)

* Use boa in CI. (#673)

* Fix covariance matrix mutating feature names (#671)

* Do not use _set_up_... in covariance_matrix

* Add changelog entry

* Add the option to store the covariance matrix to avoid recomputing it (#661)

* Add option to store covariance matrix during fit

* Fix fitting with variance matrix estimation

`.covariance_matrix()` expects X and weights in a different format than
what we have at the end of `.fit().

* Store covariance matrix after estimation

* Handle the alpha_search and glm_cv cases

* Propagate covariance parameters

* Add changelog

* Slightly more lenient tests

* Pre-commit autoupdate (#676)

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>

* Fix covariance_matrix dtypes

* Make CI use pre-release tabmat

* Column names  à la Tabmat #278 (#678)

* Delegate column naming to tabmat

* Add tests

* More tests

* Test for dropping complete categories

* Add docstrings for new argument

* Add changelog entry

* Convert to pandas at the correct place

* Reorganize converting from pandas

* Remove xfail from test

* Formula interface (#670)

* Add formulaic to dependencies

* Add function for transforming the formula

* Add tests

* First draft of glum formula interface

* Fixes and tests

* Handle intercept correctly

* Add formula functionality to glm_cv

* Variables from local context

* Test predict with formulas

* Add formula tutorial

* Fix tutorial

* Reformat tutorial

* Improve function signatures adn docstrings

* Handle two-sided formulas in covariance_matrix

* Make mypy happy about module names

* Matthias' suggestions

* Improve tutorial

* Improve tutorial

* Formula- and term-based Wald-tests (#689)

* Add formulaic to dependencies

* Add function for transforming the formula

* Add tests

* First draft of glum formula interface

* Fixes and tests

* Handle intercept correctly

* Add formula functionality to glm_cv

* Variables from local context

* Test predict with formulas

* Add formula tutorial

* Fix tutorial

* Reformat tutorial

* Improve function signatures adn docstrings

* Handle two-sided formulas in covariance_matrix

* Make mypy happy about module names

* Matthias' suggestions

* Add back term-based Wald-tests

* Tests for term names

* Add formula-based Wald-test

* Tests for formula-based Wald-test

* Add changelog

* Fix exception message

* Additional test case

* make docstrings clearer in the case of terms

* Support for missing values in categorical columns (#684)

* Delegate column naming to tabmat

* Add tests

* More tests

* Test for dropping complete categories

* Add docstrings for new argument

* Add changelog entry

* Convert to pandas at the correct place

* Reorganize converting from pandas

* Remove xfail from test

* Implement missing categorical support

* Add test

* Solve adding missing category when predicting

* Apply Matthias' suggestions

* Add changelog entry

* Fix formula context (#691)

* Make tests fail

* Propagate context through methods

* pyupgrade

* ensure_full_rank != drop_first

* fix

* move feature name assignment to right spot

* fix

* remove blank line

* bump minimum formulaic version (stateful transforms)

* improve backward compatibility

* Remove code that is not needed in tabmat v4 / glum v3 (#741)

* Remove check_array from predict()

We don't need it here as predict calls linear_redictor, and the latter does this check. We can avoid doing it twice.

* Remove _name_categorical_variable parts

There is no need for those as Tabmat v4 handles variable names internally.

---------

Co-authored-by: Martin Stancsics <[email protected]>

* Fix formula test: consider presence of intercept in full rankness check when constructing the model matrix externally (#746)

* deal with intercept in formula test correctly

* naming [skip ci]

* test varying significance level in coef table test (#749)

* pin formulaic to 0.6 (#752)

* Add illustration of formula interface to example in README (#751)

* add illustration of formula to readme

* rephrase

* spacing

* add linear term for illustration

* Determine presence of intercept only by `fit_intercept` argument (#747)

* always use self.fit_intercept; raise if formula conflicts with it

* wording [skip ci]

* adjust other tests, cosmetics

* don't compare specs with singular matrix to smf

* fix smf test formula

* fix intercept in context test

* remove outdated sentence; clean up

* fix

* adjust tutorial

* adjust tutorial

* consistent linebreaks in docstring

* remove obsolete arg in docstring

* Informative error when encountering categories that were not seen in training (#748)

* drop missings not seen in training

* zero not drop

* better (?) name [skip ci]

* catch case of unseen missings and fail method

* fix

* respect categorical missing method with formula; test different categorical missing methods also with formula

* shorten the tests

* dont allow fitting in case of conversion of categoricals and presence of formula

* clearer error msg

* also change the error msg in the regex (facepalm)

* remove matches

* fix

* better name

* describe more restrictive behavior in tutorial

* Raise error on unseen levels when predicting

* Allow cat_missing_method='convert' again

* Update test

* Check for unseen categories

* Adapt align_df_categories tests to changes

* Make pre-commit happy

* Avoid unnecessary work

* Correctly expand penalties with categoricals and `cat_missing_method="convert"` (#753)

* Correctyl expand penalties when cat_missing_method=convert

* Add test

* Improve variable names

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

---------

Co-authored-by: Matthias Schmidtblaicher <[email protected]>

* bump tabmat pre-release version

---------

Co-authored-by: Martin Stancsics <[email protected]>

* docstring cosmetics

* even more docstring cosmetics

* Do not fail when an estimator misses class members that are new in v3 (#757)

* do not fail on missing class members that are new in v3

* simplify

* convert

* shorten the comment

* simplify

* don't use getattr unnecessarily

* cosmetics

* fix unrelated typo

* tiny cosmetics [skip ci]

* No regularization as default (#758)

* set alpha=0 as default

* fix docstring

* add alpha where needed to avoid LinAlgError

* add changelog entry

* also set alpha in golden master

* change name in persisted file too

* set alpha in model_parameters again

* don't modify case of no alpha attribute, which is RegressorCV

* remove invalid alpha argument

* wording

* Improve code readability

* Make arguments to public methods except `X`, `y`, `sample_weight` and `offset` keyword-only and make initialization keyword-only (#764)

* make all args except X, y, sample_weight, offset keyword only; make initialization keyword only

* add changelog [skip ci]

* mention that also RegressorBase was changed [skip ci]

* fix import

* clean up changelog

* Restructure distributions (#768)

* Explain `scale_predictors` more (#778)

* Expand on effect of scale_predictors and remove note

* Update src/glum/_glm.py

Co-authored-by: Jan Tilly <[email protected]>

* remove sentence

---------

Co-authored-by: Jan Tilly <[email protected]>

* Move helpers into `_utils` (#782)

* Patch docstring

* Update CHANGELOG.rst

Co-authored-by: Luca Bittarello <[email protected]>

* Apply suggestions from code review

Co-authored-by: Luca Bittarello <[email protected]>

* shorten docstrings of private functions; typos in defaults; other suggestions

* context docstring

* kwargs

* no context as default; small cleanups

* add explanation to get calling scope

* adjust to tabmat release

* keep whitespace

* temporarily add tabmat_dev channel again to investigate env solving failure on CI

* remove tabmat_dev channel again

* for now, disable conda build test on osx and Python 3.12

* Add a different environment for macos (#786)

* try solving on ci with different env for macos

* add missing if

* typo

* try and remove --no-test flag

* replace deprecated scipy.sparse.*_matrix.A

* replace other instance of .A

* two more

* simply replace all instances of .A by .toarray() (tabmat knows both)

* update CHANGELOG for release

---------

Co-authored-by: quant-ranger[bot] <132915763+quant-ranger[bot]@users.noreply.github.com>
Co-authored-by: Jan Tilly <[email protected]>
Co-authored-by: Marc-Antoine Schmidt <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Matthias Schmidtblaicher <[email protected]>
Co-authored-by: Martin Stancsics <[email protected]>
Co-authored-by: Luca Bittarello <[email protected]>
Co-authored-by: lbittarello <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants