Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

ghost · 2024-11-21T20:06:38Z

Added a BIC evaluation loop to compute BIC scores directly on the entire dataset, ensuring proper model selection without train-test splits.

…lection

github-actions · 2024-11-21T20:07:58Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`black`

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py	2024-11-25 15:36:42.428442+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py	2024-11-25 15:36:51.413049+00:00
@@ -70,23 +70,22 @@
 
 from sklearn.mixture import GaussianMixture
 from sklearn.model_selection import GridSearchCV
 from sklearn.model_selection import ParameterGrid
 
+
 def gmm_bic_score(estimator, X):
     """Callable to pass to GridSearchCV that will use the BIC score."""
     # Make it negative since GridSearchCV expects a score to maximize
     return -estimator.bic(X)
 
 
 param_grid = {
     "n_components": range(1, 7),
     "covariance_type": ["spherical", "tied", "diag", "full"],
 }
-grid_search = GridSearchCV(
-    GaussianMixture(), param_grid=param_grid
-)
+grid_search = GridSearchCV(GaussianMixture(), param_grid=param_grid)
 grid_search.fit(X)
 
 bic_evaluations = []
 for params in ParameterGrid(param_grid):
     bic_value = GaussianMixture(**params).fit(X).bic(X)
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py

Oh no! 💥 💔 💥
1 file would be reformatted, 923 files would be left unchanged.

`ruff`

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


examples/mixture/plot_gmm_selection.py:71:1: I001 [*] Import block is un-sorted or un-formatted
   |
69 |   # `best_estimator_`, respectively.
70 |   
71 | / from sklearn.mixture import GaussianMixture
72 | | from sklearn.model_selection import GridSearchCV
73 | | from sklearn.model_selection import ParameterGrid
74 | | 
75 | | def gmm_bic_score(estimator, X):
   | |_^ I001
76 |       """Callable to pass to GridSearchCV that will use the BIC score."""
77 |       # Make it negative since GridSearchCV expects a score to maximize
   |
   = help: Organize imports

examples/mixture/plot_gmm_selection.py:95:19: F821 Undefined name `pd`
   |
93 |     bic_evaluations.append({**params, "BIC": bic_value})
94 | 
95 | bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
   |                   ^^ F821
96 | print(bic_evaluations.head())
   |

Found 2 errors.
[*] 1 fixable with the `--fix` option.

_{Generated for commit: 4d38439. Link to the linter CI: here}

ghost · 2024-11-22T07:10:52Z

pre-commit.ci autofix

ogrisel · 2024-11-22T15:24:46Z

pre-commit.ci autofix

This comment will not do anything. You have to follow the instructions of #30329 (comment) and push the fix your-self.

examples/mixture/plot_gmm_selection.py

ogrisel · 2024-11-22T15:27:18Z

examples/mixture/plot_gmm_selection.py

@@ -70,7 +70,7 @@

 from sklearn.mixture import GaussianMixture
 from sklearn.model_selection import GridSearchCV
-
+from sklearn.model_selection import ParameterGrid

 def gmm_bic_score(estimator, X):


We don't need this function anymore. Let's remove it.

ogrisel · 2024-11-22T15:29:14Z

examples/mixture/plot_gmm_selection.py

@@ -83,10 +83,18 @@ def gmm_bic_score(estimator, X):
    "covariance_type": ["spherical", "tied", "diag", "full"],
 }
 grid_search = GridSearchCV(
-    GaussianMixture(), param_grid=param_grid, scoring=gmm_bic_score
+    GaussianMixture(), param_grid=param_grid, scoring=None
 )
 grid_search.fit(X)


Let's move the code related to running GridSearchCV to a dedicated section because it's not related to BIC score based model selection anymore.

The name of the new section could be "Model selection based on cross-validated log-likelihood".

I need help on this- What description to be given for - "Model selection based on cross-validated log-likelihood" and which block of code to be shifted here, and do we need to add the below mentioned code block in a separate section if so please help me with the name, descriptions and placement in the document.

bic_evaluations = []
for params in ParameterGrid(param_grid):
bic_value = GaussianMixture(**params).fit(X).bic(X)
bic_evaluations.append({**params, "BIC": bic_value})

bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
print(bic_evaluations.head())

Thanks

The organization could be:

Data generation

We introduce the code used to generate our synthetic dataset.

Model selection without cross-validation using the BIC score

We present how to select the best n_components / covariance_type parameter using a for loop that computes the BIC scores on the full training set (without the cross-validation implied using GridSearchCV).

Plot the BIC scores

We do the bar plot to plot the results of the previous for loop

Model selection based on cross-validated log-likelihood

We show that we can alternatively find the best hyper-parameters maximizing the cross-validated log-likelood computed by the default .score method of GaussianMixture by using GridSearchCV without passing a custom scoring argument.

Plot the cross-validated log-likelihoods

Here, we plot the mean_test_score values from grid_search.cv_results_.

And we also we check that we find the same parameters as with the BIC selection procedure.

Plot the best model

We can keep the existing code and explanations unchanged for that last section.

examples/mixture/plot_gmm_selection.py

ogrisel · 2024-11-22T15:37:09Z

Also, @Yuvraj-Pradhan-27 please don't forget to update the description of this PR to refer to #30323 to give the context to future reviewers.

Co-authored-by: Olivier Grisel <[email protected]>

Uvi-12 added 2 commits November 22, 2024 01:00

Fixed GridSearchCV to use default log-likelihood scoring for model se…

5bf535e

…lection

Added BIC-based model selection as an efficient alternative to CV

2e43587

ghost closed this Nov 21, 2024

ghost deleted the gmm1 branch November 21, 2024 20:46

ghost restored the gmm1 branch November 22, 2024 07:10

ghost reopened this Nov 22, 2024

Merge branch 'main' into gmm1

481187f

This was referenced Nov 22, 2024

Fixed GridSearchCV to use default log-likelihood scoring for model selection #30326

Closed

DOC Example on model selection for Gaussian Mixture Models #30323

Open

ogrisel reviewed Nov 22, 2024

View reviewed changes

ogrisel changed the title ~~Added BIC-based model selection as an efficient alternative to CV~~ Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models Nov 23, 2024

Yuvraj Pradhan and others added 3 commits November 23, 2024 20:57

Update examples/mixture/plot_gmm_selection.py

3d2c437

Co-authored-by: Olivier Grisel <[email protected]>

Merge branch 'main' into gmm1

1b0f258

Merge branch 'main' into gmm1

4d38439

ghost closed this by deleting the head repository Nov 27, 2024

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

ghost commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ogrisel Nov 22, 2024

ogrisel Nov 22, 2024

ghost Nov 23, 2024

ogrisel Nov 25, 2024 •

edited

Loading

ogrisel commented Nov 22, 2024

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

Conversation

ghost commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 • edited Loading

❌ Linting issues

black

ruff

ghost commented Nov 22, 2024

ogrisel commented Nov 22, 2024

ogrisel Nov 22, 2024

Choose a reason for hiding this comment

ogrisel Nov 22, 2024

Choose a reason for hiding this comment

ghost Nov 23, 2024

Choose a reason for hiding this comment

ogrisel Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Data generation

Model selection without cross-validation using the BIC score

Plot the BIC scores

Model selection based on cross-validated log-likelihood

Plot the cross-validated log-likelihoods

Plot the best model

ogrisel commented Nov 22, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

`black`

`ruff`

ogrisel Nov 25, 2024 •

edited

Loading