Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add icl to mixture.GMM #4580

Open
eyaler opened this issue Apr 12, 2015 · 6 comments
Open

add icl to mixture.GMM #4580

eyaler opened this issue Apr 12, 2015 · 6 comments

Comments

@eyaler
Copy link

eyaler commented Apr 12, 2015

suggest adding icl = integrated completed likelihood [1], or its approximation icl-bic [2] which is an information criteria useful for determining the number of clusters. essentially it is: icl = bic + entropy of clustering

code to demonstrate:

_, probs = gmm.score_samples(X)
entropy = -sum(sum(prob_np.log(prob) for prob in probs))
icl = gmm.bic(X) + 2_entropy

[1] C. Biernacki, G. Celeux and G. Govaert, Assessing a Mixture model for Clustering with the integrated CompletedLikelihood, IEEE Transactions on Pattern analysis and Machine Intelligence 22 (2000), 719–725.
[2] G.F. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc., 2000

@siddharthswarnkar
Copy link

I would like to work on this.
Is anyone working on this @eyaler @amueller

@amueller
Copy link
Member

amueller commented Oct 8, 2016

@siddharthswarnkar It would be good to see how this compares against the existing AIC and BIC for finding the number of clusters.

@amueller
Copy link
Member

has 1000 citations but doesn't seem to have a lot of enthusiasm....

@Quentin62
Copy link

Quentin62 commented Mar 22, 2022

I was looking for ICL in GaussianMixture. Here is my current implementation

import numpy as np
from scipy.special import logsumexp
from sklearn.utils.validation import check_is_fitted

def icl(self, X):
    """Integrated Classification Likelihood criterion for the current model on the input X.
    
    You can refer to this :ref:`mathematical section <aic_bic>` for more
    details regarding the formulation of the ICL used.
    
    Parameters
    ----------
    X : array of shape (n_samples, n_dimensions)
        The input samples.
        
    Returns
    -------
    icl : float
        The lower the better.
    """
    check_is_fitted(self)
    X = self._validate_data(X, reset=False)

    weighted_log_prob = self._estimate_weighted_log_prob(X)
    log_sum_weighted_prob = np.apply_along_axis(logsumexp, 1, weighted_log_prob)
    log_tik = weighted_log_prob - log_sum_weighted_prob[:, None]

    entropy = - np.nansum(np.exp(log_tik) * log_tik)

    return self.bic(X) + 2 * entropy

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2024

We need an empirical evaluation on many datasets sizes, dimensions, number of clusters, levels of cluster separation/overlap, and covariance structures to measure the fraction of times ICL or BIC recover the true data generating process from finite samples.

Note that our current documentation wrongfully recommends to use BIC for model selection via GridSearchCV (on left-out data) while it's designed to be used for in-sample model selection (without CV). This particular problem is being tracked in its own dedicated issue: #30323.

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2024

It would be best to implement this study in a public notebook outside the scikit-learn repo for a start, and based on the results we can decide how to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants