-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add icl to mixture.GMM #4580
Comments
@siddharthswarnkar It would be good to see how this compares against the existing AIC and BIC for finding the number of clusters. |
has 1000 citations but doesn't seem to have a lot of enthusiasm.... |
I was looking for ICL in GaussianMixture. Here is my current implementation import numpy as np
from scipy.special import logsumexp
from sklearn.utils.validation import check_is_fitted
def icl(self, X):
"""Integrated Classification Likelihood criterion for the current model on the input X.
You can refer to this :ref:`mathematical section <aic_bic>` for more
details regarding the formulation of the ICL used.
Parameters
----------
X : array of shape (n_samples, n_dimensions)
The input samples.
Returns
-------
icl : float
The lower the better.
"""
check_is_fitted(self)
X = self._validate_data(X, reset=False)
weighted_log_prob = self._estimate_weighted_log_prob(X)
log_sum_weighted_prob = np.apply_along_axis(logsumexp, 1, weighted_log_prob)
log_tik = weighted_log_prob - log_sum_weighted_prob[:, None]
entropy = - np.nansum(np.exp(log_tik) * log_tik)
return self.bic(X) + 2 * entropy |
We need an empirical evaluation on many datasets sizes, dimensions, number of clusters, levels of cluster separation/overlap, and covariance structures to measure the fraction of times ICL or BIC recover the true data generating process from finite samples. Note that our current documentation wrongfully recommends to use BIC for model selection via |
It would be best to implement this study in a public notebook outside the scikit-learn repo for a start, and based on the results we can decide how to move forward. |
suggest adding icl = integrated completed likelihood [1], or its approximation icl-bic [2] which is an information criteria useful for determining the number of clusters. essentially it is: icl = bic + entropy of clustering
code to demonstrate:
[1] C. Biernacki, G. Celeux and G. Govaert, Assessing a Mixture model for Clustering with the integrated CompletedLikelihood, IEEE Transactions on Pattern analysis and Machine Intelligence 22 (2000), 719–725.
[2] G.F. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc., 2000
The text was updated successfully, but these errors were encountered: