ãã¤ã¼ããã¤ãºãç¨ããããã¹ãåé¡
ä»ã¾ã§PRMLãèªãã§å®è£
ãç¶ãã¦ãã¾ãããã10ç« ããã¯é£ããã¦æ¯ãç«ããªããªã£ã¦ããã®ã§ãããã§å°ãå
·ä½çãªå¿ç¨ã«ç®ãåãã¦ã¿ããã¨æãã¾ããæ©æ¢°å¦ç¿ã®å¿ç¨å
ã¨ãã¦ã¯ç»åã®æ¹ãçµæãè¦ã¦ãã¦é¢ç½ããã§ãããå½é¢ã¯èªç¶è¨èªå¦çãåãä¸ãã¾ãããããªããã§ä¸çªå§ãã®å¿ç¨ã¯æ©æ¢°å¦ç¿ã¨èªç¶è¨èªå¦çã®æ¥ç¹ã¨ãã¦é常ã«éè¦ãªããã¹ãåé¡ï¼Text Classification, Text Categorizationï¼ã®ææ³ãã¡ã試ãã¦ããããã¨æãã¾ããããã¹ãåé¡ã¯ææ¸åé¡ï¼Document Classificationï¼ã¨ããå¼ã³æ¹ãããã¾ããããã¹ãã¨ææ¸ã¯åãæå³ã§ããæåãªã®ã§èªåã®ç¥èã®æ´çã¨å
¥éè
ã¸ã®ç´¹ä»ã®ããã«ã¡ãã£ã¨ä¸å¯§ã«ã¾ã¨ãã¦ã¿ã¾ããã
ããã¹ãåé¡ã¨ã¯
ããã¹ãåé¡ã¨ã¯ãä¸ããããææ¸ï¼Webãã¼ã¸ã¨ãï¼ããããããä¸ããããããã¤ãã®ã«ãã´ãªï¼ã¯ã©ã¹ï¼ã«èªååé¡ããã¿ã¹ã¯ã§ããããã¹ãåé¡ã¯å¯¾è±¡ã¨ããããã¹ãã«ãã£ã¦å¹ åºãå¿ç¨ãå¯è½ã§ãããã¨ãã°ããã§ã«å®ç¨åããã¦èº«è¿ã§ãä¸è©±ã«ãªã£ã¦ããæ©è½ã¨ãã¦ã¯ã
- é»åã¡ã¼ã«ããã¹ãã ãã¨ããã以å¤ãã¨ããã«ãã´ãªã¸èªååé¡ãã¦ãã¹ãã ããã´ãç®±ã¸æ¨ã¦ãï¼ã¹ãã ãã£ã«ã¿ï¼
- Webãã¼ã¸ããæ¿æ²»ã»çµæ¸ããç§å¦ã»å¦åããã³ã³ãã¥ã¼ã¿ã»ITããã²ã¼ã ã»ã¢ãã¡ããªã©ã®ã«ãã´ãªã¸èªååé¡ï¼ã¯ã¦ãªããã¯ãã¼ã¯ï¼
- ãã¥ã¼ã¹è¨äºããèå³ããããèå³ãªããã¨ããã«ãã´ãªã¸èªååé¡ãã¦ãèå³ãããã®ãã¥ã¼ã¹è¨äºã ãããããï¼æ å ±æ¨è¦ã»æ å ±ãã£ã«ã¿ãªã³ã°ï¼
ãªã©ãããã¾ãããããããé»åã¡ã¼ã«ãWebãã¼ã¸ããã¥ã¼ã¹è¨äºãããã¹ãã«å½ããã¾ãããã¨ãã°ãç§ãæç¨ãã¦ããはてなブックマークã§ããã人éãWebãã¼ã¸ã®å 容ãèªãã§ããã®ãã¼ã¸ã¯ãã³ã³ãã¥ã¼ã¿ã»ITãã ãªã¨ãåé¡ãã¦ããããã§ã¯ãªããæ©æ¢°å¦ç¿ã®ææ³ãç¨ããåé¡ããã°ã©ã ï¼åé¡å¨ã¨å¼ã¶ï¼ãèªåçã«åé¡ãã¦ãã¾ãã
大éã®Webãã¼ã¸ãæ¯æ¥æ¯æ¥åºã¦ããã®ã«ãããªã®äººæã§ã§ããã¯ããªãã§ãããã¼ï¼Yahoo!ã¯æããã人æã§ãã£ã¦ã¾ãããä»ã¯ã©ããªãã§ããããï¼ï¼ã
æå¸«ããå¦ç¿
ä»çµã¿ã¯ããã§ããã¾ãã人éãæå¸«ã¨ãªã£ã¦åé¡å¨ãè¨ç·´ãã¾ãããããªæãã
Webãã¼ã¸1ã¯ãITã Webãã¼ã¸2ã¯ãç§å¦ã Webãã¼ã¸3ã¯ãITã Webãã¼ã¸4ã¯ãæ¿æ²»ã Webãã¼ã¸5ã¯ãã²ã¼ã ã ã»ã»ã»
ãã®ãããªï¼ããã¹ã,人éãä¸ããæ£è§£ã«ãã´ãªï¼ãçµã¨ãããã¼ã¿ãè¨ç·´ãã¼ã¿ã¨å¼ã³ã¾ããåé¡å¨ã¯ãã®è¨ç·´ãã¼ã¿ããã¨ã«åã«ãã´ãªã®ææ¸ã®ç¹å¾´ãèªåå¦ç¿ãã¾ãããã¨ãã°ã
- ãiPhoneããAppleããTwitterããªã©ã®åèªãå«ã¾ããããã¹ãã¯ãITãã«ãã´ãªã§ãã確çãé«ã
- ãæ°ä¸»å ããè ç´äººããªã©ã®åèªãå«ã¾ããããã¹ãã¯ãæ¿æ²»ãã«ãã´ãªã§ãã確çãé«ã
- ãç ç©¶ããJAXAããéºä¼åããªã©ã®åèªãå«ã¾ããããã¹ãã¯ãç§å¦ãã«ãã´ãªã§ãã確çãé«ã
ãªã©ã§ãããã®ããã«è¨ç·´ããåé¡å¨ãç¨ãã¦ãã«ãã´ãªãããããªãæ°ããææ¸ããã¨ãã°ããAppleããiPhoneããå«ã¾ããææ¸ã®ã«ãã´ãªã¯ï¼ã¨åé¡å¨ã«èãã¨ãITãã§ãã確çãé«ãã¨è¿ãã¦ããã¾ããä¸è¬çã«è¨ç·´ãã¼ã¿ã¯å¤ããã°å¤ãã»ã©åé¡å¨ã¯æ£ç¢ºãªããã¹ãåé¡ãã§ããããã«ãªãã¾ãããã®ããã«ã人éãæ£è§£ã«ãã´ãªãè¨ç·´ãã¼ã¿ã¨ãã¦ä¸ããæ©æ¢°å¦ç¿ææ³ã¯æå¸«ããå¦ç¿ã¨å¼ã³ã¾ãã
Bag-of-words
ä¸è¬çã«ããã¹ãã¯åèªã®éåã¨ãã¦ä¸ãã¾ããéåãªã®ã§ä¸¦ã³é ã¯ç¡è¦ããã¾ããã¤ã¾ããåèªãææ¸å ã«ã©ãã«åºã¦ãããã¯èæ ®ãã¾ããããã®ãããªããã¹ã表ç¾ã¯bag-of-wordsã¨å¼ã°ãã¾ããåèªãããã°ã®ä¸ã«ãã¡ããã¡ãè©°ãè¾¼ãã¤ã¡ã¼ã¸ã§ããããããã¨ãã°ã
ããã¹ãåé¡ã¨ã¯ãä¸ããããããã¹ãããããããä¸ãããã¦ããã«ãã´ãªã«ãèªåã§ãåé¡ããã¿ã¹ã¯ã§ãã
ã¨ããææ¸ã¯ãbag-of-wordsã§è¡¨ãã¨
ããã¹ã ããã¹ã ã«ãã´ãª ã¿ã¹ã¯ èªå åé¡ åé¡
ã¿ããã«åèªã®éåã§è¡¨ããã¾ããã¿ã¹ã¯ã«ãããã¾ãããå½¢æ ç´ è§£æï¼2009/4/15ï¼ã§åè©ã ãæ½åºãã¦ä½¿ããã¨ãå¤ãããããªããã¨æãã¾ãã話ã¯ããã¾ãããVisual Wordsãç¨ããé¡ä¼¼ç»åæ¤ç´¢ï¼2010/2/27ï¼ã§åãä¸ããbag-of-visual wordsã¯bag-of-wordsã®ç»åçã§ããbag-of-visual wordsãbag-of-wordsã¨ä¼¼ã¦ãã¦ç»åã«ãããåèªï¼å±æç¹å¾´éã®ã»ã³ããã¤ãï¼ãç»åä¸ã®ã©ãã«ãããã¯èæ ®ãã¾ããããã®ãããªåç´åã®ãããã§å¦ç¿ã¢ã«ã´ãªãºã ãã·ã³ãã«ã«ãªãã¾ãã
ããã¹ãåé¡ã®ææ³
ããã¹ãåé¡ã¯é常ã«å¤ãã®ç ç©¶ãããããã®ã¢ã«ã´ãªãºã ã大éã«ããã¾ããã¡ãã£ã¨æãã¤ãã ãã§ãããã¤ã¼ããã¤ãºãæ±ºå®æ¨ãRocchioå顿³ãk-æè¿åæ³ããã¸ã¹ãã£ãã¯å帰ããã¥ã¼ã©ã«ãããã¯ã¼ã¯ããµãã¼ããã¯ãã«ãã·ã³ããã¼ã¹ãã£ã³ã°ãªã©ãªã©ãããããããæ¹ã¯ã ãã¶éã£ã¦ãã¾ããã¾ããããã¹ãããã¯ãã«ã¸å¤æããææ³ï¼TF-IDFã¨ãï¼ã次å åæ¸ã®æ¹æ³ï¼LSIã¨ãï¼ãããããææ¡ããã¦ããããã®çµã¿åãããèããã¨çµå±ã©ã使ãã°ããã®ï¼ã£ã¦æãã§ããä¸è¬çã«ã¯ããµãã¼ããã¯ãã«ãã·ã³ããã¼ã¹ãã£ã³ã°ãä»ã®ææ³ã¨æ¯ã¹ã¦é«ç²¾åº¦ãªåé¡ãã§ããã¨è¨ããã¦ãã¾ããããããå®éã«è©¦ãã¦ããã¾ããä»ååãä¸ããã®ã¯ããã使ããã¦ãã¦å®è£ ãç°¡åããããé«éã¨ãããã¤ã¼ããã¤ãºã§ãã精度è©ä¾¡ã®ãã¼ã¹ã©ã¤ã³ã¨ãã¦ãã使ããã¦ã¾ãã
ãã¤ã¼ããã¤ãº
ãã¤ã¼ããã¤ãºã®ä¸å¿ã¨ãªãå¼ã¯ãã¤ãºã®å®çãå¿ç¨ããä¸ã®å¼ã§è¡¨ãã¾ãã
äºå¾ç¢ºçP(cat|doc)ã¯ææ¸docãä¸ããããã¨ãã«ãã´ãªcatã§ãã確çã§ããã«ãã´ãªãäºæ¸¬ãããæªç¥ã®ææ¸ã¯ãäºå¾ç¢ºçããã£ã¨ãé«ãã«ãã´ãªã¸åé¡ãã¾ãï¼MAPæ¨å®ï¼ããã®ç¢ºçãè¨ç®ããããã«ã¯ãå³è¾ºã®äºå確çP(cat)ã¨å°¤åº¦P(doc|catï¼ãå¿ è¦ã«ãªãã¾ããP(doc)ã¯ã©ã®ã«ãã´ãªã«ãå ±éãªã®ã§ç¡è¦ã§ãã¾ããäºå¾ç¢ºçP(cat|doc)ã¨å°¤åº¦P(doc|cat)ã¯ãããããã®ã§ããéããã®ã§ããç§ã¯ãã®éããçè§£ããã®ã«ã ãã¶è¦å´ããè¦ããããã¾ããã»ã»ã»
ã¾ããP(cat)ã§ããããã¯ç°¡åã§ããè¨ç·´ãã¼ã¿ã®åã«ãã´ãªã®ææ¸æ°ã®ç·ææ¸æ°ã«å ããå²åãè¨ç®ããã ãã§ãããã¨ãã°ã
è¨ç·´ãã¼ã¿100ææ¸ä¸ IT 50ææ¸ â P(cat=IT) = 50 / 100 = 0.5 ç§å¦ 30ææ¸ â P(cat=ç§å¦ï¼= 30 / 100 = 0.3 æ¿æ²» 20ææ¸ â P(cat=æ¿æ²») = 20 / 100 = 0.2
ã®ããã«ãªãã¾ããP(doc|cat)ã¯ã¡ãã£ã¨è¤éã§ããã«ãã´ãªcatãä¸ããããã¨ãã«ææ¸docãçæããã確çã§ããããã§ãææ¸docã¯bag-of-wordsã§åèªã®éå [word_1,word_2,...,word_k] ã¨ãã¦è¡¨ãããåèªéã®ç¬ç«æ§ãä»®å®ããã¨ããã¨ä¸ã®ããã«è¨ç®ã§ãã¾ãã
ä¸ã®å¼ã§ç¬¬2å¼ãã第3å¼ã¸ã¯åèªã®åºç¾ç¢ºçã®éã«ç¬ç«æ§ãä»®å®ããªãã¨æãç«ã¡ã¾ãããåæç¢ºçãããããã®ç¢ºçã®ç©ã§è¡¨ããã£ã¦ã®ã確çè«çç¬ç«æ§ã®å®ç¾©ã§ããæ¬æ¥ãåèªã®åºç¾ã«ç¬ç«æ§ã¯æãç«ã¡ã¾ããããã¨ãã°ãã人工ãã¨ãç¥è½ãã¯å ±èµ·ãããããããæ©æ¢°ãã¨ãå¦ç¿ãã¯å ±èµ·ããããã§ãããããç¡è¦ãã¦åèªã®åºç¾ã¯ç¬ç«ã¨ç¡çç¢çä»®å®ãã¦ææ¸ã®ç¢ºçãåèªã®ç¢ºçã®ç©ã§è¡¨ãã¦åç´åããã®ããã¤ã¼ããã¤ãºã®ãã¤ã¼ãããæä»¥ã§ããåèªéã®ä¾åé¢ä¿ãä»®å®ãããã¤ã¼ããã¤ãºã¨ãã¦TANï¼Tree-Augmented Naive Bayesï¼ã¨ããã®ãææ¡ããã¦ãã¾ããããã¾ãåºã¾ã£ã¦ãªãã¨ãããè¦ãã¨å´å¤ããã¦åå°ãªãã£ã¦æãã§ããããï¼
ã§ãä»åº¦ã¯P(word_i|cat)ã®ç¢ºçãå¿ è¦ã§ããããã¯ãåèªã®æ¡ä»¶ä»ã確çã¨å¼ã³ã¾ããã«ãã´ãªã®ä¸ã§ãã®åèªãã©ããããã§ã¦ãããããã表ãã¾ããããã¯ç°¡åã§ãè¨ç·´ãã¼ã¿ã®ã«ãã´ãªcatã«åèªword_kãåºã¦ããåæ°ãã«ãã´ãªcatã®å ¨åèªæ°ã§å²ãã°OKã§ããT(cat,word)ãã«ãã´ãªcatã«åèªwordãåºã¦ããåæ°ãVãè¨ç·´ãã¼ã¿ä¸ã®å ¨åèªéåï¼ããã£ãã©ãªï¼ã¨ããã¨ã
ã¨ãªãã¾ãã忝ã¯Vã®ãã¹ã¦ã®åèªã«é¢ãã¦è¶³ãåããã¾ãããå®éã¯å¯¾è±¡ã«ãã´ãªcatã«åºã¦ããåèªã«çµã£ã¦ãçµæã¯åãã§ãããã®ã«ãã´ãªã«åºã¦ããªãã£ãåèªã¯T(cat,word)=0ã¨ãªãããã§ãã
対æ°
以ä¸ã®çµæãã¾ã¨ããã¨æçµçã«åé¡ãããã«ãã´ãªcat_mapã¯
ã¨ãªãã¾ããargmaxf(x)ã£ã¦ã®ã¯f(x)ãæå¤§ã«ãªããããªxãè¿ãã£ã¦ããæå³ã§ããP(word|cat)ã¨ããã®ã¯é常ã«å°ããæ°ãªä¸ã«ææ¸ä¸ã«ã¯ããããã®åèªãå«ã¾ããã®ã§ããç®é¨åãã¢ã³ãã¼ããã¼ãèµ·ããå¯è½æ§ãããã¾ããããã§ã対æ°ãã¨ã£ã¦ããç®ãè¶³ãç®åãã¾ããäºå¾ç¢ºçã®å¤§å°é¢ä¿ã¯å¯¾æ°ãã¨ã£ã¦ãå¤åããªãï¼çµæã¨ãªãcat_mapã¯å¤åããªãï¼ã®ã§åé¡ããã¾ããã
ã¼ãé »åº¦åé¡
P(doc|cat)ã¯åèªã®æ¡ä»¶ä»ã確çP(word|cat)ã®ç©ã§æ±ã¾ã£ãã®ã§ãããã¢ã³ãã¼ããã¼ä»¥å¤ã«ãã1ã¤å¤§ããªåé¡ãããã¾ããããã¯ãæªç¥ã®ææ¸ã®ã«ãã´ãªãäºæ¸¬ããéãè¨ç·´ãã¼ã¿ã®ããã£ãã©ãªã«å«ã¾ããªãåèªã1ã¤ã§ãå«ãã§ããã¨åèªã®æ¡ä»¶ä»ã確çP(word|cat)ã¯0ã¨ãªããåèªã®æ¡ä»¶ä»ã確çã®ç©ã§è¡¨ãããP(doc|cat)ã0ã¨ãªã£ã¦ãã¾ããã¨ã§ãï¼å¯¾æ°ã®ã¨ãã¯log 0ã¨ãªãè¨ç®ã§ããªããªãã¾ãï¼ãã¤ã¾ãããã®æ°ããææ¸ãçæããã確çã¯0ã«ãªã£ã¦ãã¾ãã¾ãã
ãã¨ãã°ãææ¸ã«iPhoneãAppleãªã©ã®åèªãå«ã¾ãã¦ãããããã£ãããã¯ã«ãã´ãªITããçæãããå¯è½æ§ãé«ããªã£ã¦ãããã¨æã£ã¦ãã¦ããè¨ç·´æã«ã¯å«ã¾ããªãã£ãæ°åèªiPadãå«ã¾ãã¦ãã¾ãã¨P(doc|cat) = 0ã¨ãªãããã®ææ¸ãã«ãã´ãªITããçæããã確çã¯0ã«ãªã£ã¦ãã¾ãã¾ããiPhoneã¨Appleãåºã¦ãã®ã ããã«ãã´ãªã¯ITã®å¯è½æ§ãé«ãã ãï¼ããã¯ããããï¼ã£ã¦ãã¨ã«ãªãã¾ãããã®åé¡ã¯ãã¼ãé »åº¦åé¡ã¨å¼ã°ãã¦ãã¾ããã¼ãé »åº¦åé¡ã¯ãã¹ã ã¼ã¸ã³ã°ã¨ããæ¹æ³ã§ç·©åã§ãã¾ãããã使ãããã®ãåèªã®åºç¾åæ°ã«1ãå ããã©ãã©ã¹ã¹ã ã¼ã¸ã³ã°ï¼Laplace Smoothingï¼ã§ããæ°ããåèªãåºã¦ããã¨ç¢ºçã¯ä½ããªãã¾ããã0ã«ã¯ãªãã¾ããã
Pythonã§å®è£
ä¸ã®ãç´ ç´ã«Pythonã§å®è£ ããã¨ä¸ã®ããã«ãªãã¾ãã対æ°ãã¨ããã©ãã©ã¹ã¹ã ã¼ã¸ã³ã°ã使ã£ã¦ãã¾ããåèªã®æ¡ä»¶ä»ã確çP(word|cat)ã®åæ¯ã¯ã弿°ã®åèªã«ãããªãããè¨ç·´æã«äºåã«ä¸æ¬è¨ç®ãã¦ãã¾ãããããåèªã®æ¡ä»¶ä»ã確çãæ±ãããã³ã«è¨ç®ãããã¨ããã¨ãã®ãããé ããªãã¾ãã
#coding:utf-8 import math import sys from collections import defaultdict class NaiveBayes: """Multinomial Naive Bayes""" def __init__(self): self.categories = set() # ã«ãã´ãªã®éå self.vocabularies = set() # ããã£ãã©ãªã®éå self.wordcount = {} # wordcount[cat][word] ã«ãã´ãªã§ã®åèªã®åºç¾åæ° self.catcount = {} # catcount[cat] ã«ãã´ãªã®åºç¾åæ° self.denominator = {} # denominator[cat] P(word|cat)ã®åæ¯ã®å¤ def train(self, data): """ãã¤ã¼ããã¤ãºåé¡å¨ã®è¨ç·´""" # ææ¸éåããã«ãã´ãªãæ½åºãã¦è¾æ¸ãåæå for d in data: cat = d[0] self.categories.add(cat) for cat in self.categories: self.wordcount[cat] = defaultdict(int) self.catcount[cat] = 0 # ææ¸éåããã«ãã´ãªã¨åèªãã«ã¦ã³ã for d in data: cat, doc = d[0], d[1:] self.catcount[cat] += 1 for word in doc: self.vocabularies.add(word) self.wordcount[cat][word] += 1 # åèªã®æ¡ä»¶ä»ã確çã®åæ¯ã®å¤ãããããã䏿¬è¨ç®ãã¦ããï¼é«éåã®ããï¼ for cat in self.categories: self.denominator[cat] = sum(self.wordcount[cat].values()) + len(self.vocabularies) def classify(self, doc): """äºå¾ç¢ºçã®å¯¾æ° log(P(cat|doc)) ããã£ã¨ã大ããªã«ãã´ãªãè¿ã""" best = None max = -sys.maxint for cat in self.catcount.keys(): p = self.score(doc, cat) if p > max: max = p best = cat return best def wordProb(self, word, cat): """åèªã®æ¡ä»¶ä»ã確ç P(word|cat) ãæ±ãã""" # ã©ãã©ã¹ã¹ã ã¼ã¸ã³ã°ãé©ç¨ # wordcount[cat]ã¯defaultdict(int)ãªã®ã§ã«ãã´ãªã«åå¨ããªãã£ãåèªã¯ããã©ã«ãã®0ãè¿ã # 忝ã¯train()ã®æå¾ã§ä¸æ¬è¨ç®æ¸ã¿ return float(self.wordcount[cat][word] + 1) / float(self.denominator[cat]) def score(self, doc, cat): """ææ¸ãä¸ããããã¨ãã®ã«ãã´ãªã®äºå¾ç¢ºçã®å¯¾æ° log(P(cat|doc)) ãæ±ãã""" total = sum(self.catcount.values()) # ç·ææ¸æ° score = math.log(float(self.catcount[cat]) / total) # log P(cat) for word in doc: # logãã¨ãã¨ããç®ã¯è¶³ãç®ã«ãªã score += math.log(self.wordProb(word, cat)) # log P(word|cat) return score def __str__(self): total = sum(self.catcount.values()) # ç·ææ¸æ° return "documents: %d, vocabularies: %d, categories: %d" % (total, len(self.vocabularies), len(self.categories)) if __name__ == "__main__": # Introduction to Information Retrieval 13.2ã®ä¾é¡ data = [["yes", "Chinese", "Beijing", "Chinese"], ["yes", "Chinese", "Chinese", "Shanghai"], ["yes", "Chinese", "Macao"], ["no", "Tokyo", "Japan", "Chinese"]] # ãã¤ã¼ããã¤ãºåé¡å¨ãè¨ç·´ nb = NaiveBayes() nb.train(data) print nb print "P(Chinese|yes) = ", nb.wordProb("Chinese", "yes") print "P(Tokyo|yes) = ", nb.wordProb("Tokyo", "yes") print "P(Japan|yes) = ", nb.wordProb("Japan", "yes") print "P(Chinese|no) = ", nb.wordProb("Chinese", "no") print "P(Tokyo|no) = ", nb.wordProb("Tokyo", "no") print "P(Japan|no) = ", nb.wordProb("Japan", "no") # ãã¹ããã¼ã¿ã®ã«ãã´ãªãäºæ¸¬ test = ["Chinese", "Chinese", "Chinese", "Tokyo", "Japan"] print "log P(yes|test) =", nb.score(test, "yes") print "log P(no|test) =", nb.score(test, "no") print nb.classify(test)
ä¸ã®ããã°ã©ã ã§ã¯ãIntroduction to Information Retrieval(IIR)のTable 13.1の例題ã使ã£ã¦ãã¾ããè¨ç·´ãã¼ã¿ã¯ããªã¹ãã®ãªã¹ãã§æ¸¡ãã¾ããå å´ã®ãªã¹ãã1ã¤ã®è¨ç·´ãã¼ã¿ã§ãããªã¹ãã®0çªç®ã®è¦ç´ ãã«ãã´ãªã«ãªãã¾ãï¼ãã使ãããå½¢å¼ï¼ããã¨ãã°ã1ã¤ãã®è¨ç·´ãã¼ã¿ã¯ãbag-of-words表ç¾ã§[Chinese, Beijing, Chinese]ã¨ããææ¸ãã«ãã´ãªyesã§ãããã¨ãæå³ãã¦ãã¾ãã4ã¤ã®è¨ç·´ãã¼ã¿ãä¸ãã¦ãã¤ã¼ããã¤ãºåé¡å¨ãå¦ç¿ãã[Chinese, Chinese, Chinese, Tokyo, Japan]ã¨ããææ¸ã®ã«ãã´ãªãåé¡å¨ã§äºæ¸¬ãã¦ã¾ããIIRã®çµæã¨åããyesã«åé¡ããã¾ãã以ä¸ãåºåçµæã§ãã
documents: 4, vocabularies: 6, categories: 2 P(Chinese|yes) = 0.428571428571 P(Tokyo|yes) = 0.0714285714286 P(Japan|yes) = 0.0714285714286 P(Chinese|no) = 0.222222222222 P(Tokyo|no) = 0.222222222222 P(Japan|no) = 0.222222222222 log P(yes|test) = -8.10769031284 log P(no|test) = -8.906681345 yes
ã¾ã¨ã
ãã¤ã¼ããã¤ãºã«ã¯2ã¤ã®ä»£è¡¨çãªã¢ãã«ãããã¾ããå¤é ã¢ãã«ï¼Multinomial Modelï¼ã¨ãã«ãã¼ã¤ã¢ãã«ï¼Bernoulli Modelï¼ã§ããä»åãå®è£ ããã®ã¯å¤é ã¢ãã«ã§ããç§ã®å°è±¡ã§ã¯ãå¤é ã¢ãã«ã®æ¹ããã使ããã¦ããæ°ããã¾ãããã«ãã¼ã¤ã¢ãã«ã¯ãã¾ãè¦ããã¾ããã2ã¤ã®åé¡ç²¾åº¦ãæ¯è¼ããè«æï¼McCallum,1998ï¼ã«ããã¨ããã£ãã©ãªæ°ãå¤ãå ´åã¯å¤é ã¢ãã«ã®æ¹ã精度ãé«ããã¨ã示ããã¦ãã¾ãããã«ãã¼ã¤ã¢ãã«ã¯åºç¾ããªãåèªã®ç¢ºçãèæ ®ããã®ã§è¨ç®éã大ããã§ãã
ä»åã¯ãã£ã¨ãåºç¤çãªããã¹ãåé¡ã®ã¢ã«ã´ãªãºã ã§ãããã¤ã¼ããã¤ãºãå®è£ ãã¦ã¿ã¾ãããç¨ããä¾é¡ããããåç´ã§ããããã¿ããªãã£ãã®ã§ã次ã¯ã¹ãã ã¡ã¼ã«ã®åé¡ããã®ããã°ã®è¨äºã«ãã´ãªï¼å·¦ã«ã«ãã´ãªã¼ã¡ãã¥ã¼ã£ã¦ã®ãããã¾ãï¼ãåé¡ãã¦ã¿ããã¨æãã¾ãã
åèæç®
- Introduction to Information Retrieval (éç§°IIRï¼13章 (PDF) - Webã§å ¨æå ¬éããã¦ãã¾ã
- F. Sebastiani: Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(1), 2002. - ããã¹ãåé¡ã®å æ¬çãªãµã¼ãã¤ããã ããã¡ã¨å¤ãã
- A. McCallum and K. Nigam: A Comparison of Event Models for Naive Bayes Text Classification (PDF), AAAI-98 Workshop on Learning for Text Categorization, 1998. - ãã¤ã¼ããã¤ãºã®å¤é ã¢ãã«ã¨ãã«ãã¼ã¤ã¢ãã«ã®æ¯è¼ãããæåãªè«æã
- ベイジアン (bayesian)、ベイズ (bayes)、ナイーブベイズ (naive bayes) ってなんですか? - ãã¤ã¼ããã¤ãºãç¨ããã¹ãã ãã£ã«ã¿ã¼ã§æåãªPOPFileã®è§£èª¬
- ナイーブベイズによるテキスト分類体験アプリ
- 新はてなブックマークでも使われてるComplement Naive Bayesを解説するよ
è£è¶³
対æ°ãã¨ã£ã¦å¤§å°ãæ¯è¼ãããã¨ã§åé¡çµæãåºããã¨ã¯ã§ãã¾ãããåé¡çµæãåºãã ãã§ãªãããã¹ããã¼ã¿ã®åã«ãã´ãªã¸ã®äºå¾ç¢ºç P(cat|doc) ãæ±ãããã¨ãã¯ä¸ã®ããã«ãã¾ãã
ã®å¼ã§P(cat|doc)ãè¨ç®ããã°ããããã§ãããæ£è¦åä¿æ°ï¼ç¢ºçã®åã1ã«ãªãããã«èª¿æ´ããããã®ä¿æ°ï¼ã®åæ¯ã®p(doc)ãæ±ããã®ããã£ãã大å¤ã§ãããã®ããä¸ã®ãããªããç¥ãããè£æãããã¾ãã
def postProb(self, doc, cat): """ææ¸ãä¸ããããã¨ãã®ã«ãã´ãªã®ãæ£è¦åãã¦ããªã ï¼=p(doc)ã§å²ããªãï¼ãäºå¾ç¢ºç P'(cat|doc) ãæ±ãã""" total = sum(self.catcount.values()) # ç·ææ¸æ° pp = float(self.catcount[cat]) / total # äºå確çP(cat) # 尤度 P(doc|cat) = P(word1|cat) * p(word2|cat) * ... # 対æ°ãã¨ããªãã®ã§æãç®ã«ãªãï¼é常ã«å°ããªå¤ï¼ï¼ for word in doc: pp *= self.wordProb(word, cat) return pp # ãã¹ããã¼ã¿ã®åã«ãã´ãªã¸ã®äºå¾ç¢ºçãæ±ãã test = ["Chinese", "Chinese", "Chinese", "Tokyo", "Japan"] p1 = nb.postProb(test, "yes") # æ£è¦åããã¦ããªãã®ã§ç¢ºçã§ã¯ãªãï¼ p2 = nb.postProb(test, "no") # æ£è¦åããã¦ããªãã®ã§ç¢ºçã§ã¯ãªãï¼ # ä¸ã®ããã«ããã¨è¶³ãã¦1ã«ãªã確çã«ãªã print "P(yes|test) =", p1 / (p1 + p2) print "P(no|test) =", p2 / (p1 + p2)
çµæã¯ã
P(yes|test) = 0.689758611763 P(no|test) = 0.310241388237
ã¨ãªãããã¹ããã¼ã¿ãyesã§ãã確çã¯69%ãnoã§ãã確çã¯31%ã¨ãªããè¶³ãã¨1ã«ãªã確çã«ãªã£ã¦ã¾ãã
ãã¡ããã忝ã®p(doc)ãp(cat1)p(doc|cat1) + p(cat2)p(doc|cat2) + ...ã®ããã«å±éãã¦å¼ã©ããã«è¨ç®ãã¦ãåãçµæã«ãªãã¾ãã