9/3 ã® ACL èªã¿ä¼ã§èªã [Hu+ ACL11] Interactive Topic Modeling(ITM) ã®è³æã§ã(éä¸ã§ããåå°½ãã¾ããããã¾ããâ¦â¦)ã
ã追è¨ã
ãã£ãªã¯ã¬æ¨ã¨ Interactive Adding Constraints and Unassigning(âããããã®è«æã®ãã¢ï¼) ã«ã¤ãã¦ã®èª¬æã追å ãã¾ããã
ãï¼è¿½è¨ã
Interactive Topic Modeling(ITM) ã¨ã¯
- é常㮠LDA ã¯æ師ç¡ãã§ãããçµæã®å¶å¾¡ã¯åºæ¬çã«ã§ããªã
- baseball 㨠football ãåããããã¯ã«å ¥ã£ã¦æ¬²ããã¨æã£ã¦ãããã¾ãåé¡ãããªãå ´åã¯ãã©ã¡ã¼ã¿ãå¤ãã¦è©¦è¡é¯èª¤ããã¨ããåé¡å¾ã«ãããã¯ãã¯ã©ã¹ã¿ãªã³ã°ããã
- ITM 㯠LDA ã«ãåèªï¼¡ã¨ï¼¢ã¯åããããã¯ã«å ¥ã£ã¦æ¬²ãããã¨ããå¶ç´ããå¾ãããå ¥ããããã¢ãã«
Notations
- Ω_j : åããããã¯ã«å±ããã¹ãåèªã®éå(å¶ç´)
- Ω_iâ©Î©_j=∅ (iâ j)
- C_j = |Ω_j|
- Ω=âªÎ©_j
- K : ãããã¯æ°
- V : èªå½æ°
- J : å¶ç´æ°
- w_dn : ææ¸ d ã® n çªç®ã®åèª
- z_dn : w_dn ã®æ½å¨ãããã¯
- θ_d : ææ¸ d ã® ãããã¯åå¸
- Ï_k, Ï_kj : ãããã¯-åèªåå¸(Dirichlet Tree)
- T_{d,k} : ææ¸ d å ã®ããã㯠k ãæã¤åèªæ°
- P_{k,w} : ããã㯠k ãæã¤åèª w ã®åæ°
- P_{k,j} : å¶ç´ j ã«å±ãããããã㯠k ãæã¤åèªæ°
- è«æã«ã¯ W_{k,j,w} = å¶ç´ j ã«å±ãããããã㯠k ãæã¤åèª w ã®åæ°ãå°å ¥ããã¦ããããå¶ç´ã¯åèªã«å¯¾ãã¦é«ã 1ã¤ãªã®ã§ãå¶ç´ j ãæ㤠w ã«å¯¾ãã¦ã¯å¸¸ã« P_{k,w} = W_{k,j,w} ããä¸è¦ã
Dirichlet Tree
- ãã£ãªã¯ã¬æ¨ã¯ãã£ãªã¯ã¬åå¸ãé層åãããã®
- 1é層ç®ã®ãã£ãªã¯ã¬åå¸ãããå¶ç´ or å¶ç´ç¡ãã®åèªããå¼ã
- å¶ç´ãå¼ããå ´åã¯ãããã«2é層ç®ã®ãã£ãªã¯ã¬åå¸ããããã®å¶ç´ã«å±ããåèªããå¼ã
- 2é層ç®ã®ãã£ãªã¯ã¬åå¸ã®ãã©ã¡ã¼ã¿Î·ãβã¨åãå¤ã®å ´åãå
ã®ãã£ãªã¯ã¬åå¸ã¨ç価(å¾è¿°)
- ηãβãã大ãããããã¨ã§ä¸å³ã®(c)ã®ãããªåå¸ãæ§æã§ãã(é層åãã¦ããªããã£ãªã¯ã¬åå¸ã§ã¯(d)ã®ãããªåå¸ããä½ããªã)
- (c) 㨠(d) ãééã£ã¦ããã®ã§ãã³ã¡ã³ãæ¬ã§ã®ãææã«ããè¨æ£(2011/9/7)
- ηãβãã大ãããããã¨ã§ä¸å³ã®(c)ã®ãããªåå¸ãæ§æã§ãã(é層åãã¦ããªããã£ãªã¯ã¬åå¸ã§ã¯(d)ã®ãããªåå¸ããä½ããªã)
- LDA-DF (Andrzejewski+ ICML09) ã¯ãè¤æ°ã®ãã£ãªã¯ã¬æ¨ã®æ··ååå¸ãç¨ãã
- ãæ¨ãããã£ã±ãããããã森ã(=DF:Dirichlet Forest)
Collapsed Gibbs Sampling ã§æ¨è«
ãã®å¨è¾ºåå¸ããã次㮠full conditional ãå¾ãã
ãããã¯-åèªåå¸ã®æ¨å®
P(w|z,Ï,Ï)=Multi(ξ_k) ã¨ããã¦ãäºå¾åå¸ã®å¹³åã«ãã£ã¦Î¾ãæ¨å®
- θ_k ã perplexity 㯠vanilla LDA ã¨å ¨ãåæ§ã
Interactive Unassignment
- å¦ç¿ãããç¨åº¦è¡ã£ãã¨ããã§ãå¶ç´ã追å ãããã¨ãèãã
- åããããã¯ã«å ¥ã£ã¦æ¬²ãã baseball 㨠football ãå¥ã®ãããã¯ã«å ¥ã£ã¦ãã¾ã£ãï¼
- æ¨è«ã®æ´æ°å¼ã®é
ã®ä¸ã§ãå¶ç´ã®å½±é¿ãç´æ¥åããã®ã¯ P_kj ã®ã¿
- 追å ã»å¤æ´ãããå¶ç´ã«ã¤ã㦠P_kj ãæ°ãç´ãã°ããããåæå¤ã¨ãã¦æ°ããã¢ãã«ã®å¦ç¿ãè¡ããã¨ãã§ãã
- pros
- ããã¾ã§è¡ã£ãå¦ç¿ãæ´»ãããã¨ãã§ãã
- cons
- ãã§ã«å¥ã ã®ãããã¯ã«å²ãæ¯ãããåèªãå¤ãå ´åããããã Gibbs Sampling ã§æãåºãã®ã¯ LDA ã®ç¹æ§ä¸é£ãã(ã¤ãã¬ã¼ã·ã§ã³ãæ°å¤ãåããªãã¨ãããªã)
- ããã§é¨åçã«åèªã®ãããã¯ã®å²ãæ¯ã(ã¤ã¾ã z_dn)ã解é¤ãããã¨ã§ããã®åé¡ã解決ãã
- å®è£ çã«ã¯ãz_dn ã« -1 ã«ã»ãããã¦ã対å¿ããã«ã¦ã³ã¿ãæ¸ãã
- 解é¤ããç¯å²ã«ã¤ãã¦ã4éãã®æ¹éãææ¡
- 1. All
- å ¨ã¦ã®åèªã®ãããã¯å²ãæ¯ãã解é¤ãã(ã¤ã¾ãæåããå¦ç¿ãããç´ã)
- 2. Doc
- ã追å ã»å¤æ´ãããå¶ç´ã«å±ããåèªãå«ãææ¸ãã®å ¨åèªã®ãããã¯å²ãæ¯ãã解é¤ãã
- 3. Term
- ã追å ã»å¤æ´ãããå¶ç´ã«å±ããåèªãã®ãããã¯å²ãæ¯ãã解é¤ãã
- 4. None
- ãããã¯å²ãæ¯ãã解é¤ããªã(P_kj ã®æ°ãç´ãã®ã¿è¡ã)
- è«æ㯠Doc ãä¸çªå¹çãããã¨ä¸»å¼µ
- æå
ã§å®é¨ããæãã§ãããã®ä¸»å¼µã«ä¸è´ããå°è±¡(å®éçãªè©ä¾¡ã§ã¯ããã¾ãã)
- None ã¯ãã¡ãããTerm ã§ãå¶ç´ãå ¥ããåèªãåããããã¯ã«å²ãæ¯ãããã¨ã¯éããªã
- baseball 㨠football ã«å¶ç´ãå ¥ãã¦ãããããããã®ãããã¯ããªã»ãããã¦ããããããã¨å ±èµ·åº¦ã®é«ãå¥ã®åèª(ä¾ãã° baseball - pitcher, football - goal)ãå¤ãå±ãããããã¯ã«å¼ã£å¼µããã
- ãããã£ã¦å ±èµ·ãããåèª(ã¤ã¾ãåãææ¸ã®åèª)ãã¾ã¨ãã¦ãªã»ããããæ¹ãæãçµæãå¾ããããã
ã¢ãã«ã«ã¤ãã¦èå¯
- å¶ç´ããªãå ´åãvanilla LDA ã¨ç価
- β=ηã®ã¨ããvanilla LDA ã¨ç価(å¶ç´ããã£ã¦ã)
- CANNOT Link ã®ç¡ã LDA-DF ã«ã¢ãã«ã¨ãã¦ç価
- ã¨ãããã㧠Interactive Unassignment ã ITM ã®ãã¢
ãβ=ηã®ã¨ããvanilla LDA ã¨ç価ãã®è¨¼æ
ç°¡åã®ãã V=3, w_1 㨠w_2 ã«å¶ç´ãå
¥ã£ã¦ããå ´åã§èª¬æããã¨ã
P(w_1, w_2, w_3|z) = Multi(ξ_1, ξ_2, ξ_3) ã«ã¤ãã¦
P(ξ) = P(ξ_1, ξ_2)ã»P(ξ_3|ξ_1, ξ_2) ã§ãã
ãã ã P(ξ_1, ξ_2)=Dir(η, η), P(ξ_3|ξ_1, ξ_2) = Beta(β, 2β) ( Dir(β, 2β) ã¨åç )ã
ãã®ã¨ãη=βãªããP(ξ) 㯠Dir(β, β, β) ãç©ã«å解ãããã®ã«ä¸è´ããâ
å®è£ ãã¦ã¿ã
- https://github.com/shuyo/iir/blob/master/lda/itm.py
- https://github.com/shuyo/iir/blob/master/lda/vocabulary.py
- Python + numpy + nltk
Usage: itm.py [options] Options: -h, --help show this help message and exit -m MODEL model filename -f FILENAME corpus filename -b CORPUS using range of Brown corpus' files(start:end) --alpha=ALPHA parameter alpha --beta=BETA parameter beta --eta=ETA parameter eta -k K number of topics -i ITERATION iteration count --seed=SEED random seed --df=DF threshold of document freaquency to cut words -c CONSTRAINT add constraint (wordlist which should belong to the same topic) -u UNASSIGN, --unassign=UNASSIGN unassign method (all/doc/term/none)
ã¾ã¨ã
- çµæ§ãããããï¼ä½¿ãããã
- ãã£ã¨ããããå®é¨ãã¦ã¿ãã
- å®è£ å ¬éããã®ã§èå³ã®ãã人ã¯è©¦ãã¦ã¿ã¦
- η(=100)ã大ããããæ°ããã(ã°ã©ãã®èµ¤)
- apple ã service(礼æ) ãªã©è¤æ°ã®ãããã¯ã«åå¸ããåèªãå ¨é¨åãå¶ç´ã®åèªã«çµã³ã¤ãã¦ãã¾ã
- η=2 (ã°ã©ãã®é»)ãããã§ããã®ã§ã¯ï¼ãåæãé ãï¼
- ãã¥ã¼ãªã¹ãã£ãã¯ã ããηãæåã¯å¤§ãããå¾ã ã«å°ãããã¦ããã¨ãã
- å¶ç´ãã¨ã«Î·ãå¤ãã¦ããããããããªã
- å¤ç¾©æ§ãæã¤åèªãå¶ç´ã«å ããå ´åã®æ¯ãèãã確èªãã¦ããããã¨ãã
References
- [Hu+ ACL11] Interactive Topic Modeling
- [Blei+ 2003] Latent Dirichlet Allocation
- [Andrzejewski+ ICML09] Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors
- [å°æ+ 2011] è«çå¶ç´ä»ããããã¯ã¢ãã«ã®ããã®ãã£ãªã¯ã¬æ£®äºååå¸æ§ææ³