-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illustrate BIC-based model selection as an efficient alternative to CV for Gaussian mixture models #30329
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
pre-commit.ci autofix |
This comment will not do anything. You have to follow the instructions of #30329 (comment) and push the fix your-self. |
@@ -70,7 +70,7 @@ | |||
|
|||
from sklearn.mixture import GaussianMixture | |||
from sklearn.model_selection import GridSearchCV | |||
|
|||
from sklearn.model_selection import ParameterGrid | |||
|
|||
def gmm_bic_score(estimator, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this function anymore. Let's remove it.
@@ -83,10 +83,18 @@ def gmm_bic_score(estimator, X): | |||
"covariance_type": ["spherical", "tied", "diag", "full"], | |||
} | |||
grid_search = GridSearchCV( | |||
GaussianMixture(), param_grid=param_grid, scoring=gmm_bic_score | |||
GaussianMixture(), param_grid=param_grid, scoring=None | |||
) | |||
grid_search.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the code related to running GridSearchCV
to a dedicated section because it's not related to BIC score based model selection anymore.
The name of the new section could be "Model selection based on cross-validated log-likelihood".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need help on this- What description to be given for - "Model selection based on cross-validated log-likelihood" and which block of code to be shifted here, and do we need to add the below mentioned code block in a separate section if so please help me with the name, descriptions and placement in the document.
bic_evaluations = []
for params in ParameterGrid(param_grid):
bic_value = GaussianMixture(**params).fit(X).bic(X)
bic_evaluations.append({**params, "BIC": bic_value})
bic_evaluations = pd.DataFrame(bic_evaluations).sort_values("BIC", ascending=True)
print(bic_evaluations.head())
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The organization could be:
Data generation
We introduce the code used to generate our synthetic dataset.
Model selection without cross-validation using the BIC score
We present how to select the best n_components
/ covariance_type
parameter using a for loop that computes the BIC scores on the full training set (without the cross-validation implied using GridSearchCV
).
Plot the BIC scores
We do the bar plot to plot the results of the previous for loop
Model selection based on cross-validated log-likelihood
We show that we can alternatively find the best hyper-parameters maximizing the cross-validated log-likelood computed by the default .score
method of GaussianMixture
by using GridSearchCV
without passing a custom scoring
argument.
Plot the cross-validated log-likelihoods
Here, we plot the mean_test_score
values from grid_search.cv_results_
.
And we also we check that we find the same parameters as with the BIC selection procedure.
Plot the best model
We can keep the existing code and explanations unchanged for that last section.
Also, @Yuvraj-Pradhan-27 please don't forget to update the description of this PR to refer to #30323 to give the context to future reviewers. |
Co-authored-by: Olivier Grisel <[email protected]>
Added a BIC evaluation loop to compute BIC scores directly on the entire dataset, ensuring proper model selection without train-test splits.