add icl to mixture.GMM #4580

eyaler · 2015-04-12T13:34:12Z

suggest adding icl = integrated completed likelihood [1], or its approximation icl-bic [2] which is an information criteria useful for determining the number of clusters. essentially it is: icl = bic + entropy of clustering

code to demonstrate:

_, probs = gmm.score_samples(X)
entropy = -sum(sum(prob_np.log(prob) for prob in probs))
icl = gmm.bic(X) + 2_entropy

[1] C. Biernacki, G. Celeux and G. Govaert, Assessing a Mixture model for Clustering with the integrated CompletedLikelihood, IEEE Transactions on Pattern analysis and Machine Intelligence 22 (2000), 719–725.
[2] G.F. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc., 2000

siddharthswarnkar · 2016-02-21T18:54:35Z

I would like to work on this.
Is anyone working on this @eyaler @amueller

amueller · 2016-10-08T00:41:57Z

@siddharthswarnkar It would be good to see how this compares against the existing AIC and BIC for finding the number of clusters.

amueller · 2018-05-22T20:46:34Z

has 1000 citations but doesn't seem to have a lot of enthusiasm....

Quentin62 · 2022-03-22T10:51:37Z

I was looking for ICL in GaussianMixture. Here is my current implementation

import numpy as np
from scipy.special import logsumexp
from sklearn.utils.validation import check_is_fitted

def icl(self, X):
    """Integrated Classification Likelihood criterion for the current model on the input X.
    
    You can refer to this :ref:`mathematical section <aic_bic>` for more
    details regarding the formulation of the ICL used.
    
    Parameters
    ----------
    X : array of shape (n_samples, n_dimensions)
        The input samples.
        
    Returns
    -------
    icl : float
        The lower the better.
    """
    check_is_fitted(self)
    X = self._validate_data(X, reset=False)

    weighted_log_prob = self._estimate_weighted_log_prob(X)
    log_sum_weighted_prob = np.apply_along_axis(logsumexp, 1, weighted_log_prob)
    log_tik = weighted_log_prob - log_sum_weighted_prob[:, None]

    entropy = - np.nansum(np.exp(log_tik) * log_tik)

    return self.bic(X) + 2 * entropy

ogrisel · 2024-12-11T11:17:56Z

We need an empirical evaluation on many datasets sizes, dimensions, number of clusters, levels of cluster separation/overlap, and covariance structures to measure the fraction of times ICL or BIC recover the true data generating process from finite samples.

Note that our current documentation wrongfully recommends to use BIC for model selection via GridSearchCV (on left-out data) while it's designed to be used for in-sample model selection (without CV). This particular problem is being tracked in its own dedicated issue: #30323.

ogrisel · 2024-12-11T11:20:51Z

It would be best to implement this study in a public notebook outside the scikit-learn repo for a start, and based on the results we can decide how to move forward.

amueller added the New Feature label Apr 13, 2015

cmarmo added the module:mixture label Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add icl to mixture.GMM #4580

add icl to mixture.GMM #4580

eyaler commented Apr 12, 2015 •

edited by ogrisel

Loading

siddharthswarnkar commented Feb 21, 2016

amueller commented Oct 8, 2016

amueller commented May 22, 2018

Quentin62 commented Mar 22, 2022 •

edited

Loading

ogrisel commented Dec 11, 2024 •

edited

Loading

ogrisel commented Dec 11, 2024

add icl to mixture.GMM #4580

add icl to mixture.GMM #4580

Comments

eyaler commented Apr 12, 2015 • edited by ogrisel Loading

siddharthswarnkar commented Feb 21, 2016

amueller commented Oct 8, 2016

amueller commented May 22, 2018

Quentin62 commented Mar 22, 2022 • edited Loading

ogrisel commented Dec 11, 2024 • edited Loading

ogrisel commented Dec 11, 2024

eyaler commented Apr 12, 2015 •

edited by ogrisel

Loading

Quentin62 commented Mar 22, 2022 •

edited

Loading

ogrisel commented Dec 11, 2024 •

edited

Loading