Labeled LDA (D. Ramage, D. Hall, R. Nallapati and C.D. Manning; EMNLP2009) is a supervised topic model derived from LDA (Blei+ 2003).
While LDA’s estimated topics don’t often equal to human’s expectation because it is unsupervised, Labeled LDA is to treat documents with multiple labels.
I implemented Labeled LDA in python. The code is here:
- https://github.com/shuyo/iir/blob/master/lda/llda.py
- https://github.com/shuyo/iir/blob/master/lda/llda_nltk.py
llda.py is LLDA estimation for arbitrary test. Its input file consits on each line as a document.
Each document can be given label(s) to put in brackets at the head of the line, like the following:
[label1,label2,...]some text
To give documents without labels, it works like semi-supervised.
llda_nltk.py is a sample with llda.py. It estimates LLDA model with 100 sampled documents from Reuters corpus in NLTK and outputs top 20 words for each label.
(Ramage+ EMNLP2009) doesn’t figure out LLDA’s perplexity, then I derived document-topic distributions and topic-word ones that it requires:
where λ(d) means a topic set corresponding to labels of the document d, and Md is a size of the set.
I’m glad you to point out if these are wrong.
LLDA is neccessary that labels assign to topics explicitly and exactly. But it is very diffucult to know how many topics to assign to each label for better estimation.
Moreover it is natural that some categories may have common topics (e.g. “baseball” category and “soccer” category).
DP-MRM (Kim+ ICML2012), as I introduced in this blog, is a model to extend LLDA to estimate label-topic corresponding.
https://shuyo.wordpress.com/2012/07/31/kim-icml12-dirichlet-process-with-mixed-random-measures/
Though I want to implement it too, it is very complex…