DOC Example on model selection for Gaussian Mixture Models #30323

ogrisel · 2024-11-21T16:11:13Z

Describe the issue linked to the documentation

We have an example that illustrates how to use the BIC score to tune the number of components and the type of covariance matrix parametrization here:

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html

However, the BIC score is not meant to be computed in a CV loop, but instead directly on the training set. So we should not use it with a GridSearchCV call. Indeed, the BIC score already penalizes the number of parameters depending on the number of data-points in the training set.

Instead, we should call the GridSearchCV on the default .score method of the GMM estimator, which computes the log-likelihood and is a perfectly fine metric to select the best model on held out data in a CV loop.

Note that we can keep computing the BIC score for all the hparam combinations but we should either do it in a single for loop (without train-test split), e.g.:

from sklearn.model_selection import ParameterGrid
from sklearn.mixture import GaussianMixture
import pandas as pd
import numpy as np


n_samples = 500
rng = np.random.default_rng(0)
C = np.array([[0.0, -0.1], [1.7, 0.4]])
component_1 = rng.normal(size=(n_samples, 2)) @ C  # general
component_2 = 0.7 * rng.normal(size=(n_samples, 2)) + np.array([-4, 1])  # spherical
X = np.concatenate([component_1, component_2])

param_grid = {
    "n_components": np.arange(1, 7),
    "covariance_type": ["full", "tied", "diag", "spherical"],
}

bic_evaluations = []
for params in ParameterGrid(param_grid):
    bic_value = GaussianMixture(**params).fit(X).bic(X)
    bic_evaluations.append({**params, "BIC": bic_value})

bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
bic_evaluations.head()

So in summary I would recommend to:

update the existing GridSearchCV code to use the scoring=None default that would use the built-in log-likelihood based model evaluation (averaged on the test sets of the CV loop);
add a new section for BIC-based model selection with the for loop I proposed above as a computationally cheaper alternative to CV-based model selection with the log-likelihood.

We can then check that the two methods select the same model: 2 components with "full" (non-spherical) parametrization of the covariance matrix.

The text was updated successfully, but these errors were encountered:

ghost · 2024-11-21T20:51:46Z

I think the things you asked for should be implemented because these changes ensure correct usage of model selection methods with GaussianMixture.

I tried the recommendations you mentioned, the result-

ghost · 2024-11-22T06:40:02Z

@glemaitre review my comment and let me know if the changes are to be made?

ogrisel · 2024-11-22T15:23:24Z

@Yuvraj-Pradhan-27: your original comment feels like the output of an LLM paraphrasing the description I wrote in the issue. If that's the case, please refrain from posting such redundant and verbose comments in the future. If that's not the case, then I am sorry ;)

I see that you opened two PRs:

Fixed GridSearchCV to use default log-likelihood scoring for model selection #30326
Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

I closed the first one because it felt subsumed by #30329.

When opening PRs, please refer to the issue they are related to in the description to make it easier to discover them and avoid having several people concurrently working on the same issue.

ghost · 2024-11-22T15:34:19Z

I did paraphrase the same description you wrote the point was to support your recommendations only,I executed the code you mentioned. I am new to GitHub contributions, I am really sorry if I did something which us unwanted, will take care of it from next time.

ogrisel · 2024-11-22T16:12:29Z

No problem. If you are new to contributing to scikit-learn, please make sure to follow the guidelines from: https://scikit-learn.org/dev/developers/index.html

And welcome to the project!

ghost · 2024-11-22T16:15:52Z

I will go through the document thoroughly, thanks for the help.

Uvi-12 · 2024-12-04T10:17:23Z

Data Generation
We generate two components, each containing n_samples, by randomly sampling from the standard normal distribution using numpy.random.randn. One component remains spherical but is shifted and scaled, while the other is deformed to have a more general covariance matrix.

Model Selection without Cross-Validation using the BIC Score
We select the best model by varying the number of components from 1 to 6 and testing different covariance types. The covariance types include:
"full": each component has its own general covariance matrix.
"tied": all components share the same general covariance matrix.
"diag": each component has its own diagonal covariance matrix.
"spherical": each component has its own single variance.

We evaluate the models using the BIC score, keeping the model with the lowest BIC. This process avoids the need for a train-test split by computing the BIC directly on the full dataset, ensuring proper model selection.

Plotting the BIC Scores
We visualize the BIC scores using a bar plot. The model with 2 components and full covariance, which corresponds to the true generative model, is expected to have the lowest BIC and will be selected by the model selection procedure.

Model Selection Based on Cross-Validated Log-Likelihood
We can find the best model by maximizing the cross-validated log-likelihood. This is done using GridSearchCV with the default scoring method of GaussianMixture, which computes the log-likelihood for each model.

Plotting the Cross-Validated Log-Likelihoods
We visualize the cross-validated log-likelihood scores in a bar plot. This allows us to verify that the same parameters are selected as when using the BIC method.

Plotting the Best Model
No changes

@ogrisel I have prepared the documentation according to your instructions. Please review it and let me know if any changes are required. If everything looks good, could you please provide the code to be placed in the sections of the page?

ogrisel · 2024-12-11T11:02:51Z

@Uvi-12 this plan looks good. It's quite in line with the suggestion I made in the previous review: #30329 (comment)

ensuring proper model selection.

"Proper" is a bit strong. BIC is still some kind of heuristic model selection tool to me. I would rather say, "leveraging BIC's ability to trade data fidelity for model complexity".

Uvi-12 · 2024-12-11T12:16:22Z

@Uvi-12 this plan looks good. It's quite in line with the suggestion I made in the previous review: #30329 (comment)

ensuring proper model selection.

"Proper" is a bit strong. BIC is still some kind of heuristic model selection tool to me. I would rather say, "leveraging BIC's ability to trade data fidelity for model complexity".

Thank you for the feedback, I will update it as you said. I am relatively new to the code base can you please help me with the code to be added to the documentation?

ogrisel added Documentation Needs Triage Issue requires triage labels Nov 21, 2024

glemaitre removed the Needs Triage Issue requires triage label Nov 21, 2024

ogrisel mentioned this issue Nov 22, 2024

Fixed GridSearchCV to use default log-likelihood scoring for model selection #30326

Closed

ogrisel mentioned this issue Nov 22, 2024

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

Closed

ogrisel mentioned this issue Dec 11, 2024

add icl to mixture.GMM #4580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC Example on model selection for Gaussian Mixture Models #30323

DOC Example on model selection for Gaussian Mixture Models #30323

ogrisel commented Nov 21, 2024 •

edited

Loading

ghost commented Nov 21, 2024 •

edited by ghost

Loading

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ghost commented Nov 22, 2024

Uvi-12 commented Dec 4, 2024

ogrisel commented Dec 11, 2024

Uvi-12 commented Dec 11, 2024

DOC Example on model selection for Gaussian Mixture Models #30323

DOC Example on model selection for Gaussian Mixture Models #30323

Comments

ogrisel commented Nov 21, 2024 • edited Loading

Describe the issue linked to the documentation

ghost commented Nov 21, 2024 • edited by ghost Loading

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ghost commented Nov 22, 2024

Uvi-12 commented Dec 4, 2024

ogrisel commented Dec 11, 2024

Uvi-12 commented Dec 11, 2024

ogrisel commented Nov 21, 2024 •

edited

Loading

ghost commented Nov 21, 2024 •

edited by ghost

Loading