Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329

Closed
wants to merge 6 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Nov 21, 2024

Added a BIC evaluation loop to compute BIC scores directly on the entire dataset, ensuring proper model selection without train-test splits.

Screenshot 2024-11-22 at 1 36 00 AM

Copy link

github-actions bot commented Nov 21, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py	2024-11-25 15:36:42.428442+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py	2024-11-25 15:36:51.413049+00:00
@@ -70,23 +70,22 @@
 
 from sklearn.mixture import GaussianMixture
 from sklearn.model_selection import GridSearchCV
 from sklearn.model_selection import ParameterGrid
 
+
 def gmm_bic_score(estimator, X):
     """Callable to pass to GridSearchCV that will use the BIC score."""
     # Make it negative since GridSearchCV expects a score to maximize
     return -estimator.bic(X)
 
 
 param_grid = {
     "n_components": range(1, 7),
     "covariance_type": ["spherical", "tied", "diag", "full"],
 }
-grid_search = GridSearchCV(
-    GaussianMixture(), param_grid=param_grid
-)
+grid_search = GridSearchCV(GaussianMixture(), param_grid=param_grid)
 grid_search.fit(X)
 
 bic_evaluations = []
 for params in ParameterGrid(param_grid):
     bic_value = GaussianMixture(**params).fit(X).bic(X)
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/mixture/plot_gmm_selection.py

Oh no! 💥 💔 💥
1 file would be reformatted, 923 files would be left unchanged.

ruff

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


examples/mixture/plot_gmm_selection.py:71:1: I001 [*] Import block is un-sorted or un-formatted
   |
69 |   # `best_estimator_`, respectively.
70 |   
71 | / from sklearn.mixture import GaussianMixture
72 | | from sklearn.model_selection import GridSearchCV
73 | | from sklearn.model_selection import ParameterGrid
74 | | 
75 | | def gmm_bic_score(estimator, X):
   | |_^ I001
76 |       """Callable to pass to GridSearchCV that will use the BIC score."""
77 |       # Make it negative since GridSearchCV expects a score to maximize
   |
   = help: Organize imports

examples/mixture/plot_gmm_selection.py:95:19: F821 Undefined name `pd`
   |
93 |     bic_evaluations.append({**params, "BIC": bic_value})
94 | 
95 | bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
   |                   ^^ F821
96 | print(bic_evaluations.head())
   |

Found 2 errors.
[*] 1 fixable with the `--fix` option.

Generated for commit: 4d38439. Link to the linter CI: here

@ghost ghost closed this Nov 21, 2024
@ghost ghost deleted the gmm1 branch November 21, 2024 20:46
@ghost ghost restored the gmm1 branch November 22, 2024 07:10
@ghost
Copy link
Author

ghost commented Nov 22, 2024

pre-commit.ci autofix

@ghost ghost reopened this Nov 22, 2024
@ogrisel
Copy link
Member

ogrisel commented Nov 22, 2024

pre-commit.ci autofix

This comment will not do anything. You have to follow the instructions of #30329 (comment) and push the fix your-self.

examples/mixture/plot_gmm_selection.py Outdated Show resolved Hide resolved
@@ -70,7 +70,7 @@

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import ParameterGrid

def gmm_bic_score(estimator, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this function anymore. Let's remove it.

@@ -83,10 +83,18 @@ def gmm_bic_score(estimator, X):
"covariance_type": ["spherical", "tied", "diag", "full"],
}
grid_search = GridSearchCV(
GaussianMixture(), param_grid=param_grid, scoring=gmm_bic_score
GaussianMixture(), param_grid=param_grid, scoring=None
)
grid_search.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the code related to running GridSearchCV to a dedicated section because it's not related to BIC score based model selection anymore.

The name of the new section could be "Model selection based on cross-validated log-likelihood".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need help on this- What description to be given for - "Model selection based on cross-validated log-likelihood" and which block of code to be shifted here, and do we need to add the below mentioned code block in a separate section if so please help me with the name, descriptions and placement in the document.

bic_evaluations = []
for params in ParameterGrid(param_grid):
bic_value = GaussianMixture(**params).fit(X).bic(X)
bic_evaluations.append({**params, "BIC": bic_value})

bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
print(bic_evaluations.head())

Thanks

Copy link
Member

@ogrisel ogrisel Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The organization could be:

Data generation

We introduce the code used to generate our synthetic dataset.

Model selection without cross-validation using the BIC score

We present how to select the best n_components / covariance_type parameter using a for loop that computes the BIC scores on the full training set (without the cross-validation implied using GridSearchCV).

Plot the BIC scores

We do the bar plot to plot the results of the previous for loop

Model selection based on cross-validated log-likelihood

We show that we can alternatively find the best hyper-parameters maximizing the cross-validated log-likelood computed by the default .score method of GaussianMixture by using GridSearchCV without passing a custom scoring argument.

Plot the cross-validated log-likelihoods

Here, we plot the mean_test_score values from grid_search.cv_results_.

And we also we check that we find the same parameters as with the BIC selection procedure.

Plot the best model

We can keep the existing code and explanations unchanged for that last section.

examples/mixture/plot_gmm_selection.py Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Nov 22, 2024

Also, @Yuvraj-Pradhan-27 please don't forget to update the description of this PR to refer to #30323 to give the context to future reviewers.

@ogrisel ogrisel changed the title Added BIC-based model selection as an efficient alternative to CV Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models Nov 23, 2024
@ghost ghost closed this by deleting the head repository Nov 27, 2024
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants