DOC Example on model selection for Gaussian Mixture Models

### Describe the issue linked to the documentation

We have an example that illustrates how to use the BIC score to tune the number of components and the type of covariance matrix parametrization here:

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html

However, the BIC score is not meant to be computed in a CV loop, but instead directly on the training set. So we should not use it with a `GridSearchCV` call. Indeed, the BIC score already penalizes the number of parameters depending on the number of data-points in the training set.

Instead, we should call the `GridSearchCV` on the default `.score` method of the GMM estimator, which computes the log-likelihood and is a perfectly fine metric to select the best model on held out data in a CV loop.

Note that we can keep computing the BIC score for all the hparam combinations but we should either do it in a single for loop (without train-test split), e.g.:

```python
from sklearn.model_selection import ParameterGrid
from sklearn.mixture import GaussianMixture
import pandas as pd
import numpy as np


n_samples = 500
rng = np.random.default_rng(0)
C = np.array([[0.0, -0.1], [1.7, 0.4]])
component_1 = rng.normal(size=(n_samples, 2)) @ C  # general
component_2 = 0.7 * rng.normal(size=(n_samples, 2)) + np.array([-4, 1])  # spherical
X = np.concatenate([component_1, component_2])

param_grid = {
    "n_components": np.arange(1, 7),
    "covariance_type": ["full", "tied", "diag", "spherical"],
}

bic_evaluations = []
for params in ParameterGrid(param_grid):
    bic_value = GaussianMixture(**params).fit(X).bic(X)
    bic_evaluations.append({**params, "BIC": bic_value})

bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
bic_evaluations.head()
```

So in summary I would recommend to:

- update the existing `GridSearchCV` code to use the `scoring=None` default that would use the built-in log-likelihood based model evaluation (averaged on the test sets of the CV loop);
- add a new section for BIC-based model selection with the for loop I proposed above as a computationally cheaper alternative to CV-based model selection with the log-likelihood.

We can then check that the two methods select the same model: 2 components with "full" (non-spherical) parametrization of the covariance matrix.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Example on model selection for Gaussian Mixture Models #30323

Describe the issue linked to the documentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DOC Example on model selection for Gaussian Mixture Models #30323

Description

Describe the issue linked to the documentation

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions