çµ±è¨æ°çç 究æã«ã¦è¡ããã第ï¼åçµ±è¨çæ©æ¢°å¦ç¿ã»ããã¼ã«ã®ãã®ãåå ãã¦ãã¾ããã
ä»åã¯ãã³ãã©ã¡ããªãã¯ãã¤ãºç¹éã¨ãããã¨ã§ããYee Whye Teh ããã sequence memoizer ããææ©ãããæ師ç¡ãï¼åæ師åãã¡æ¸ãã話ãããã®ã§ãã¾ã㯠sequence memoizer ã«ã¤ãã¦èªåã®ãããç¯å²ã§æ¸ãã¦ã¿ããã
ã¾ããPitman-Yor éç¨ã«ã¤ãã¦ã¯æ¢ç¥ã¨ããããåããªãæ¹ã¯ã「独断と偏見によるノンパラ入門」ãèªãã°ã ãããããâ¦â¦ããªãã(è¦ç¬)ã
ããã¨ãã¨ããããä»åå¿
è¦ãªç¯å²ã§èª¬æããã¨ãG ã¨ããåèªã®åå¸(ãã ãå°ã¯ç¡éãã¤ã¾ããç¬æã¨åè¦ããã§ããããã®ä»ãã®ç®ããããµã¤ã³ã)ã«å¯¾ãã¦ãG' ã PY(θ,d,G) ã¨ããã¨ãG ãå
ã«ããåå¸ãèæ
®ããæ°ããåèªåå¸ G' ãä½ã£ã¦ãããããã㪠PY(θ,d,G) ã Pitman-Yor éç¨ãθ㯠concentration ãã©ã¡ã¼ã¿ãd 㯠discount ãã©ã¡ã¼ã¿ãG 㯠base measureã
ä¾ãã°ä¸æè°ã®å½ã¢ãªã¹ã«ã¯ "rabbit" ã 51åç»å ´ããã"rabbit" ã®åã«ã¯ "the" ã "white" ãæ¥ãã"white rabbit" 㯠22åç»å ´ããã
ãã㧠"white rabbit" ã®å¾ãã«ç¶ãåèªã§ãã "with" ã "read" ã¯ãå
¨ã¦ "rabbit" ã®å¾ãã«ç¶ãåèªã®åå¸ã«åºã¦ãããéã« "rabbit" ã®å¾ãã«ç¶ãåèªã¯å¿
ããã "white rabbit" ã®å¾ãã«ç¶ãã¨ã¯éããªãã
ãã®ã¨ã "white rabbit" ã®å¾ãã«ç¶ãåèªã®åå¸ G_{white rabbit} ã "rabbit" ã®å¾ãã«ç¶ãåèªã®åå¸ G_{rabbit} ãã G_{white rabbit} ã PY(θ,d,G_{rabbit}) ã¨ãã¦ä½ã£ã¦ãããã¨é常ã«ç²¾åº¦ã®é«ãåå¸ãã§ããã
åãããã« "the rabbit" ã "the white rabbit" ã®åå¸ã G_{the rabbit} ã PY(θ,d,G_{rabbit}) ã G_{the white rabbit} ã PY(θ,d,G_{white rabbit}) ã¨ãã¦ä½ããããã G_{rabbit} ã¯ã¨ããã¨ãåèªã®äºååå¸ H ãã G_{rabbit} ã PY(θ,d,H) ã¨ãã¦ä½ãã
G_{rabbit} 㯠G_{white rabbit} ã®è¦ªãG_{white rabbit} 㯠G_{the white rabbit} ã®è¦ªãâ¦â¦ã¨è¦ãªãã¦ãããã¨ãH ãæ ¹ã«æã¤ããªã¼æ§é ã«ãªãããã®ããªã¼æ§é ãä¾ãã°æ·±ãï¼ã¾ã§ä½ã£ãã¨ãã4-gram ã®é層 Pitman-Yor è¨èªã¢ãã«ã«ãªããG_{3åèª} ã¯ãã®æ¬¡ã®ï¼ã¤ç®ã®åèªã®åå¸ãä¸ãã¦ãããã 4-gram ã«ãªãã
å¦ç¿ãã¼ã¿ã«ãªããã¬ã¼ãºã¯ãµã¤ã³ãã®ããã®ä»ãã®é¢ã«å¯¾å¿ããããããåºãå ´åã«ã¯ã親ã®åèªåå¸ããåèªãå¼ã£å¼µã£ã¦ããã親ã®åèªåå¸ã¯åã®ããããä¸è¬ã«å°ãåºãã®ã§ãããã§å¦ç¿ãã¼ã¿ã«ãªããã¬ã¼ãºãçæãããã¨ãã§ããã
ãã¡ããã親ã®åèªåå¸ã«ãããã®ä»ãé¢ãããã®ã§ããããåºããããã«è¦ªã®è¦ªã«ããã®ã¼ã£ã¦ãããæçµçã«ã¯åèªã®äºååå¸ H ãæ§ãã¦ããã®ã§ãã©ãã¾ã§ãæ»ãå¿é
ã¯ãªããã大ä¸å¤«ã
ãã®æå㯠4-gram ã® smoothing 㧠3-gram/2-gram/1-gram ã®ç¢ºçãå¾æãä»ãã¦å ãããã¨ã«ç¸å½ããã¡ããã©é層 Pitman-Yor è¨èªã¢ãã«ã¯ Kneser-Ney smoothing ã® Bayesian ãªè¡¨ç¾ã«ãªã£ã¦ããã
sequence memoizer ã¯ãã®é層 Pitman-Yor è¨èªã¢ãã«ã®é層ãç¡éã®æ·±ãã¾ã§èãã â-gram ã®è¨èªã¢ãã«ã
åç´ã«ä½ãã¨ç¡éã®æ·±ãã® suffix "trie" ãä½ããã¨ã«ãªã£ã¦ãã¾ãããåãï¼ã¤ããæããªã G_{hoge} ãå
¨é¨å¨è¾ºåãããã¨ã§ suffix "tree" ã¨åãå½¢ã«ãªãããã¼ãæ°ãæç« é·ã®2åã§æãããããã¨ãè¨ããã
å¨è¾ºåã¨ã¯ã¤ã¾ã G_{white rabbit} ã PY(θ,d,G_{rabbit}) 㨠G_{the white rabbit} ã PY(θ,d,G_{white rabbit}) ãããã¨ããéã® G_{white rabbit} ããã£é£ã°ã㦠G_{rabbit} ããç´æ¥ G_{the white rabbit} ãä½ãã¨ãããã¨ã ããããã¯ãPitman-Yor ã® Pitman-Yorããè¨ç®ã§ããªãã¨ãããªãã
ããå®ã¯ãã©ã¡ã¼ã¿ã«ããå¶ç´ãå
¥ãã¦ããã°ãPitman-Yor ã® Pitman-Yorã ã Pitman-Yor ã«ãªããã¨ãè¨ãããããããã¢ã
ãããã¦ä½ã£ã sequence memoizer ã¯è¨èªã¢ãã«ã¨ãã¦ç¾æç¹ã§æé«ç²¾åº¦ããããåºãã
精度ãé«ãè¨èªã¢ãã«ã¨è¨ããã¨ã¯ãç¶ãåèªã®ç¢ºçãæ£ç¢ºã«äºæ¸¬ã§ããâå§ç¸®ã«ä½¿ã£ãããããããã¨ãããã¨ã§å®éã« sequence memoizer ã使ã£ãå§ç¸®ãä¸è¨ã®ãã¢ãµã¤ãã§è©¦ããã¨ãã§ããã
bzip2 ã¨ä¸ã®ãµã¤ãã§ä¸æè°ã®å½ã®ã¢ãªã¹ãå§ç¸®ãã¦ãã¡ã¤ã«ãµã¤ãºãæ¯ã¹ã¦ã¿ãã
ãªãªã¸ãã« | 145,283 |
bzip2-9 | 40,777 |
deplump | 36,506 |
ãã¼ããå
¨ç¶ãã¾ã説æã§ãã¦ãªããªã¼ãé£ããã
ãã¡ãããã¤ãã®ããã«çªã£è¾¼ã¿æè¿ã
åç §è«æ
- [Teh 2006] A Hierarchical Bayesian Language Model based on Pitman-Yor Processes
- [Teh 2006] A Bayesian Interpretation of Interpolated Kneser-Ney
- [Wood+ 2009] A Stochastic Memoizer for Sequence Data
- [Gasthaus+ 2010] Lossless Compression based on Sequence Memoizer
- [Gasthaus+ 2010] Improvements to the Sequence Memoizer