-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC Example on model selection for Gaussian Mixture Models #30323
Comments
@glemaitre review my comment and let me know if the changes are to be made? |
@Yuvraj-Pradhan-27: your original comment feels like the output of an LLM paraphrasing the description I wrote in the issue. If that's the case, please refrain from posting such redundant and verbose comments in the future. If that's not the case, then I am sorry ;) I see that you opened two PRs:
I closed the first one because it felt subsumed by #30329. When opening PRs, please refer to the issue they are related to in the description to make it easier to discover them and avoid having several people concurrently working on the same issue. |
I did paraphrase the same description you wrote the point was to support your recommendations only,I executed the code you mentioned. I am new to GitHub contributions, I am really sorry if I did something which us unwanted, will take care of it from next time. |
No problem. If you are new to contributing to scikit-learn, please make sure to follow the guidelines from: https://scikit-learn.org/dev/developers/index.html And welcome to the project! |
I will go through the document thoroughly, thanks for the help. |
Data Generation Model Selection without Cross-Validation using the BIC Score We evaluate the models using the BIC score, keeping the model with the lowest BIC. This process avoids the need for a train-test split by computing the BIC directly on the full dataset, ensuring proper model selection. Plotting the BIC Scores Model Selection Based on Cross-Validated Log-Likelihood Plotting the Cross-Validated Log-Likelihoods Plotting the Best Model @ogrisel I have prepared the documentation according to your instructions. Please review it and let me know if any changes are required. If everything looks good, could you please provide the code to be placed in the sections of the page? |
@Uvi-12 this plan looks good. It's quite in line with the suggestion I made in the previous review: #30329 (comment)
"Proper" is a bit strong. BIC is still some kind of heuristic model selection tool to me. I would rather say, "leveraging BIC's ability to trade data fidelity for model complexity". |
Thank you for the feedback, I will update it as you said. I am relatively new to the code base can you please help me with the code to be added to the documentation? |
Describe the issue linked to the documentation
We have an example that illustrates how to use the BIC score to tune the number of components and the type of covariance matrix parametrization here:
https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_selection.html
However, the BIC score is not meant to be computed in a CV loop, but instead directly on the training set. So we should not use it with a
GridSearchCV
call. Indeed, the BIC score already penalizes the number of parameters depending on the number of data-points in the training set.Instead, we should call the
GridSearchCV
on the default.score
method of the GMM estimator, which computes the log-likelihood and is a perfectly fine metric to select the best model on held out data in a CV loop.Note that we can keep computing the BIC score for all the hparam combinations but we should either do it in a single for loop (without train-test split), e.g.:
So in summary I would recommend to:
GridSearchCV
code to use thescoring=None
default that would use the built-in log-likelihood based model evaluation (averaged on the test sets of the CV loop);We can then check that the two methods select the same model: 2 components with "full" (non-spherical) parametrization of the covariance matrix.
The text was updated successfully, but these errors were encountered: