Python implementation of Labeled LDA (Ramage+ EMNLP2009)

July 24, 2013February 10, 2021 ~ shuyo ~ 16 Comments

Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.

I implemented Labeled LDA in python. The code is here:

llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:

[label1,label2,...]some text

To give documents without labels, it works like semi-supervised.

llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.

(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:

$\theta_{dz}=\mathbb{E}[p(z|d)]=\begin{cases}\frac{n_{dz}+\alpha}{n_{d\cdot}+M_d\alpha}\hspace{20}&\text{if }z\in\lambda^{(d)}\\0&\text{otherwize}\end{cases}$
$\phi_{zw}=\mathbb{E}[p(w|z)]=\frac{n_{zw}+\eta}{n_{z\cdot}+V\eta}$

where λ^(d) means a topic set corresponding to labels of the document d, and M_d is a size of the set.

I’m glad you to point out if these are wrong.

LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).

DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.

https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/

Though I want to implement it too, it is very complex…

HDP-LDA updates

August 15, 2012August 16, 2012 ~ shuyo ~ 9 Comments

Hierarchical Dirichlet Processes (Teh+ 2006) are a nonparametric bayesian topic model which can treat infinite topics.
In particular, HDP-LDA is interesting as an extention of LDA.

(Teh+ 2006) introduced updates of Collapsed Gibbs sampling for a general framework of HDP, but not for HDP-LDA.
To obtain updates of HDP-LDA, it is necessary to apply the base measure H and the emission F(phi) on HDP-LDA’s setting into the below equation:

$f_k^{-x_{ji}}(x_{ji}) = \frac{\int f(x_{ji}|\phi_k)\prod_{j'i'\neq ji,z_{j'i'}=k}f(x_{j'i'}|\phi_k)h(\phi_k)d\phi_k}{\int \prod_{j'i'\neq ji,z_{j'i'}=k}f(x_{j'i'}|\phi_k)h(\phi_k)d\phi_k}$ , (eq. 30 on [Teh+ 2006])

where h is a probabilistic density function of H and f is one of F.
In the case of HDP-LDA, H is a Dirichlet distribution over vocabulary and F is a topic-word multinominal distribution, that is

$h(\phi)=\rm{Dir}(\beta)=\frac 1Z \prod_v \phi_v^{\beta-1}$ where $Z=\frac{\prod_v \Gamma(\beta)}{\Gamma(V\beta)$ ,
$f(x_{ji}=v|\phi_k)=\phi_{kv}$ .

To substitute these for equation (30), we obtain

$f_k^{-x_{ji}}(x_{ji})=\frac{\int\phi_{kv}\prod_{w}\phi_{kw}^{n_{kw}^{-ji}}\prod_{w}\phi_{kw}^{\beta-1}d\phi_{k}}{\int\prod_{w}\phi_{kw}^{n_{kw}^{-ji}}\prod_{w}\phi_{kw}^{\beta-1}d\phi_{k}}$
$=\frac{\int\phi_{kv}^{n_{kv}^{-ji}+\beta}\prod_{w\neq v}\phi_{kw}^{n_{kw}^{-ji}+\beta-1}d\phi_{k}}{\int\prod_{w}\phi_{kw}^{n_{kw}^{-ji}+\beta-1}d\phi_{k}}$
$=\frac{\Gamma(\beta+n_{kv}^{-ji}+1)\prod_{w\neq v}\Gamma(\beta+n_{kw}^{-ji})}{\Gamma(V\beta+n_{k\cdot}^{-ji}+1)}\cdot\frac{\Gamma(V\beta+n_{k\cdot}^{-ji})}{\prod_{w}\Gamma(\beta+n_{kw}^{-ji})}$
$=\frac{\Gamma(\beta+n_{kv}^{-ji}+1)\Gamma(V\beta+n_{k\cdot}^{-ji})}{\Gamma(\beta+n_{kv}^{-ji})\Gamma(V\beta+n_{k\cdot}^{-ji}+1)} = \frac{\beta+n_{kv}^{-ji}}{V\beta+n_{k\cdot}^{-ji}}$ ,

where $v=x_{ji},\; k=k_{jt_{ji}},\; n_{kv}^{-ji}=\sharp\{x_{mn}|k_{mt_{mn}}=k,x_{mn}=v,(m,n)\neq(j,i)\}$

We also need f_k^new when t takes a new table. It is obtained as the following:

$f_{k^{\text{new}}}^{-x_{ji}}(x_{ji})= \int p(x_{ji}=v|\phi)p(\phi)d\bf{\phi}= \int \phi_v\cdot\frac{\Gamma(V\beta)}{\prod_w\Gamma(\beta)}\prod_w \phi_w^{\beta-1}d\bf{\phi}$
$=\frac{\Gamma(V\beta)}{\prod_w\Gamma(\beta)}\cdot\frac{\Gamma(\beta+1)\prod_{w\neq v}\Gamma(\beta)}{\Gamma(V\beta+1)}=\frac 1V$

And it is necessary to write down f_k(x_jt) also for sampling k.

$f_k^{-\bf{x}_{jt}}(\bf{x}_{jt})$
$=\frac{\int\prod_{x_{ji}\in\bf{x}_{jt}}f(x_{ji}|\ph_k)\prod_{x_{mn}\notin\bf{x}_{jt},z_{mn}=k}f(x_{mn}|\ph_k)h(\ph_k)d\ph_k}{\int\prod_{x_{mn}\notin\bf{x}_{jt},z_{mn}=k}f(x_{mn}|\ph_k)h(\ph_k)d\ph_k}$

For

$n_{kw}=\sharp\{x_{mn}|k_{mt_{mn}}=k,\;x_{mn}=w\}$ (it means “term count of word w with topic k”)
$n_{kw}^{-jt}=\sharp\{x_{mn}|k_{mt_{mn}}=k,\;x_{mn}=w,\;(m,t_{mn})\neq(j,t)\}$ (excluding $\bf{x}_{jt}=\{x_{ji}|t_{ji}=t\}$ ),

$f_{k}^{-\bf{x}_{jt}}(\bf{x}_{jt})= \frac{\prod_w\Gamma(\beta+n_{kw})}{\Gamma(V\beta+n_{k\cdot})} \cdot \frac{\Gamma(V\beta+n_{k\cdot}^{-jt})}{\prod_w\Gamma(\beta+n_{kw}^{-jt})}$
$= \frac{\prod_w(\beta+n_{kw}-1)\cdots(\beta+n_{kw}^{-jt})}{(V\beta+n_{k\cdot}-1)\cdots(V\beta+n_{k\cdot}^{-jt})}$

When implementation in Python, it is faster not to unfold Gamma functions than another. It is necessary to use these logarithms in either case, or f_k(x_jt) must overflow float range.

Finally,
$f_{k^{\text{new}}}^{-\bf{x}_{jt}}(\bf{x}_{jt}) = \frac{\Gamma(V\beta)\prod_v\Gamma(\beta+n_{\cdot v}^{jt})}{\Gamma(V\beta+n_{\cdot\cdot}^{jt})\prod_v\Gamma(\beta)}$

[Kim+ ICML12] Dirichlet Process with Mixed Random Measures

July 31, 2012August 15, 2012 ~ shuyo ~ 7 Comments

We held a private reading meeting for ICML 2012.
I took and introduced [Kim+ ICML12] “Dirichlet Process with Mixed Random Measures : A Nonparametric Topic Model for Labeled Data.”
This is the presentation for it.

DP-MRM [Kim+ ICML12] is a supervised topic model like sLDA [Blei+ 2007], DiscLDA [Lacoste-Julien+ 2008] and MedLDA [Zhu+ 2009], and is regarded as a nonparametric version of Labeled LDA [Ramage+ 2009] in particular.

Although Labeled LDA is easy to implement (my implementation is here), it has a disadvantage that you must specify label-topic correspondings explicitly and manually.
On the other hand, DP-MRM can automatically decide label-topic correspondings as distributions. I am very interested in it.
But it is hard to implement because it is a nonparametric bayesian modal.
Hence I don’t want infinite topics but hierarchical label-topic correspondings, I guess that it will become very useful and handy and fast to replace DPs into normal Dirichlet distributions in this model… I am going to try it!

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (3)

June 27, 2011 ~ shuyo ~ 6 Comments

In the previous article, I introduced the simple implement of the collapsed gibbs sampling estimation for Latent Dirichlet Allocation(LDA).
However each word topic z_mn is initialized to a random topic in this implement, there are some toubles.

First, it needs many iterations before its perplexity begins to decrease.
Second, almost topics have some stopwords like ‘the’ and ‘of’ with high probabilities when converging its perplexity.
Moreover there are some topic groups which share similar word distributions. Therefore the substantial topics are less than topic size parameter K.

Instead of the random initialization, draw from the posterior probability form as sampling the new topic incrementally.

$P(z_{mn}=z|\bf{z}^{-mn},\bf{w})\;\prop\;(n_{mz}^{-mn}+\alpha)\cdot\frac{n_{tz}^{-mn}+\beta}{n_{z}^{-mn}+V\beta},$

Furthermore, Hence n_zt is only used with beta additionally, n_mz with alpha, and n_z with V * beta, it is an efficient implement to add the corresponding parameters to these variables beforehand.

Then, a sample initialization code is the following,

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(self.docs), K)) + alpha     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) + beta # word count of each topic and vocabulary
n_z = numpy.zeros(K) + V * beta    # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        # draw from the posterior
        p_z = n_z_t[:, t] * n_m_z[m] / n_z
        z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

and so is a sample inference code.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = n_z_t[:, t] * n_m_z[m] / n_z # Here is only changed.
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

(The inference code is similar to the previous version but is simplified at the posterior calculation.)

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (2)

May 31, 2011June 1, 2011 ~ shuyo ~ 17 Comments

Before iterations of LDA estimation, it is necessary to initialize parameters.
Collapsed Gibbs Sampling (CGS) estimation has the following parameters.

z_mn : topic of word n of document m
n_mz : word count of document m with topic z
n_tz : count of word t with topic z
n_z : word count with topic z

The most simple initialization is to assign each word to a random topic and increase the corresponding counters n_mz, n_tz and n_z.

# docs : documents which consists of word array
# K : number of topics
# V : vocaburary size

z_m_n = [] # topics of words of documents
n_m_z = numpy.zeros((len(docs), K))     # word count of each document and topic
n_z_t = numpy.zeros((K, V)) # word count of each topic and vocabulary
n_z = numpy.zeros(K)        # word count of each topic

for m, doc in enumerate(docs):
    z_n = []
    for t in doc:
        z = numpy.random.randint(0, K)
        z_n.append(z)
        n_m_z[m, z] += 1
        n_z_t[z, t] += 1
        n_z[z] += 1
    z_m_n.append(numpy.array(z_n))

Then the iterative inference with the full-conditional in the previous article is the following. That is repeated until the perplexity gets stable.

for m, doc in enumerate(docs):
    for n, t in enumerate(doc):
        # discount for n-th word t with topic z
        z = z_m_n[m][n]
        n_m_z[m, z] -= 1
        n_z_t[z, t] -= 1
        n_z[z] -= 1

        # sampling new topic
        p_z = (n_z_t[:, t] + beta) * (n_m_z[m] + alpha) / (n_z + V * beta)
        new_z = numpy.random.multinomial(1, p_z / p_z.sum()).argmax()

        # preserve the new topic and increase the counters
        z_m_n[m][n] = new_z
        n_m_z[m, new_z] += 1
        n_z_t[new_z, t] += 1
        n_z[new_z] += 1

It is using numpy.random.multinomial() method with set 1 to number of experiments for drawing from multinomial distribution.
Hence this method returns a array of which a certain k-th element is 1 and the remainder are 0, argmax() retrives the k value. This is a little wasteful…

The next article will show another efficient initialization.

Collapsed Gibbs Sampling Estimation for Latent Dirichlet Allocation (1)

May 24, 2011August 30, 2013 ~ shuyo ~ 7 Comments

Latent Dirichlet Allocation (LDA) is a generative model which is used as a language topic model and so on.

Graphical model of LDA (this figure from Wikipedia)

Each random variable means the following

θ : document-topic distribution, $\theta_m\sim\text{Dir}(\alpha)$
φ : topic-word distribution, $\varphi_z\sim\text{Dir}(\beta)$
Z : word topic, $z_{mn}\sim\text{Multi}(\theta_m)$
W : word, $w_{mn}\sim\text{Multi}(\phi_{z_mn})$

There are some populaer estimation methods for LDA, and Collapsed Gibbs sampling (CGS) is one of them.
This method is to integral out random variables except for word topic {z_mn} and draw each z_mn from posterior.
The posterior of z_mn is the following:

$P(z_{mn}=z|\bf{z}^{-mn},\bf{w})\;\prop\;(n_{mz}^{-mn}+\alpha)\cdot\frac{n_{tz}^{-mn}+\beta}{n_{z}^{-mn}+V\beta},$

where n_mz is a word count of document m with topic z, n_tz is a count of word t with topic z, n_z is a word count with topic z and -mn means “except z_mn.”
The estimation iterates until its perplexity converges or appropriate times.

$\text{perplexity}(\bf{w})=\exp\left{-\frac1N\sum_{mn}\log(\sum_z\theta_{mz}\varphi_{zw_{mn}})\right},$

where

$\varphi_{zt}=\frac{n_{tz}+\beta}{n_z+V\beta}$
$\theta_{mz}=\frac{n_{mz}+\alpha}{n_m+K\alpha},$

and n_m is a word count of document m.
However perplexities usually decrease as learnings are progressing, my experiment told some different tendencies.

Continued on the next post.

Latent Dirichlet Allocation in Python

May 18, 2011May 24, 2011 ~ shuyo ~ 19 Comments

Latent Dirichlet Allocation (LDA) is a language topic model.

In LDA, each document has a topic distribution and each topic has a word distribution.
Words are generated from topic-word distribution with respect to the drawn topics in the document.

However LDA’s estimation uses Variational Bayesian originally (Blei+ 2003), Collapsed Gibbs sampling (CGS) method is known as a more precise estimation.
So I tried implementing the CGS estimation of LDA in Python.

It requires Python 2.6, numpy and NLTK.

$ ./lda.py -h Usage: lda.py [options]

Options:
-h, --help show this help message and exit
-f FILENAME corpus filename
-c CORPUS using range of Brown corpus' files(start:end)
--alpha=ALPHA parameter alpha
--beta=BETA parameter beta
-k K number of topics
-i ITERATION iteration count
-s smart initialize of parameters
--stopwords exclude stop words
--seed=SEED random seed
--df=DF threshold of document freaquency to cut words

$ ./lda.py -c 0:20 -k 10 --alpha=0.5 --beta=0.5 -i 50

This command outputs perplexities of every iteration and the estimated topic-word distribution(top 20 words in their probabilities).
I will explain the main point of this implementation in the next article.