Introduction to Information Retrieval #12 ã®å¾©ç¿è³æ
Introduction to Information Retrieval 輪èªä¼ 12ç« ã®å¾©ç¿è³æã以ä¸ã«ã¢ãããã¼ããã¾ããã
12ç« ã¯ã㯠"Language models for information retrieval" ã¨ãããã¨ã§ã確ççè¨èªã¢ãã«ãæ å ±æ¤ç´¢ã«é©ç¨ãã話ã§ããã
確ççè¨èªã¢ãã«
確ççè¨èªã¢ãã«ã¨ã¯ãèªç¶è¨èªãæ°å¦çã«æ±ãã¢ãã«ã«åèªåãæååãèµ·ãã確çãä¸ãããã®ã§ããä¾ãã° "frog said that toad likes dog" ã¨ããåèªå s ããã£ãã¨ãã¦ãããããã®åèªã®ç起確çãä¸ãããã¦ããã¨ãã¾ãã
frog | said | that | toad | likes | that | dog | |
---|---|---|---|---|---|---|---|
M1 | 0.01 | 0.03 | 0.04 | 0.01 | 0.02 | 0.04 | 0.005 |
M2 | 0.002 | 0.03 | 0.04 | 0.0001 | 0.04 | 0.04 | 0.01 |
ããã¨
- P(s|M1) = 0.01 x 0.03 x 0.04 x 0.01 x 0.02 x 0.04 x 0.005 = 0.48 x 10^-12
- P(s|M2) = 0.002 x 0.03 x 0.04 x 0.0001 x 0.04 x 0.04 x 0.01 = 0.384 x 10^-15
ã¨ããããã®ç¢ºçåå¸ãã P(s|M) ãæ±ãããã¨ãã§ãã¾ãããã®ããã«åèªã®ç起確çããã¼ã¹ã«èªç¶è¨èªãæããã®ã確ççè¨èªã¢ãã«ã§ãããã¨ç解ãã¦ãã¾ãã
12ç« Language models for IR
11ç« ã¾ã§ã«æ±ã£ãæ¤ç´¢ã¢ãã«ã¯ãã¼ãªã¢ã³ã¢ãã«ããã¯ãã«ç©ºéã¢ãã«ã確çã¢ãã«ã§ããããããã®æ¤ç´¢ã¢ãã«ã¯ãæ¤ç´¢ã¯ã¨ãªããæ å ±ãã¼ãºãäºæ¸¬ããé©åææ¸ãããããããã¨ããèãæ¹ã§æ§ç¯ããã¦ãã¾ããä¸æ¹ã確ççè¨èªã¢ãã«(以ä¸è¨èªã¢ãã«)ãç¨ããæ¤ç´¢ã§ã¯ãææ¸ d ããè¨èªã¢ãã« Md ãæ¨å®ãããã®è¨èªã¢ãã«ãã¯ã¨ãªãçæãã確ç P(q|Md) ããã¼ã¹ã«ã©ã³ãã³ã°ãè¡ãã¾ãã
ä»æ¹ãæ å ±æ¤ç´¢ã¨ã¯æ¡ä»¶ä»ã確ç P(d|q) ãæ大åããåé¡ã¨è¦ããã¨ãã§ãã¾ããã¯ã¨ãª q ã«å¯¾ã㦠P(d|q) ãæã大ãããªã d ãæ±ããåé¡ã§ãããã® P(d|q) ã¯ãã¤ãºå®çã«ãã
- P(d|q) = P(q|d) x P(d) / P(q)
ã¨ãªãã¾ãããã®ã¨ã P(q) ã¯å ¨ã¦ã®ããã¥ã¡ã³ãã«ä¸å®ãªã®ã§ç¡è¦ãP(d) ã¯ã¯ã¨ãªã«é¢ãããªãææ¸ã®é©åæ§ã¨ãããã¨ãªãã¾ããããããããã§ã¯å®æ°ã¨ãã¦ç¡è¦ã§ãã¾ããçµå±ãargmax P(d|q) 㯠argmax P(q|d) ãæ±ãããã¨ã«å¸°çãã¾ããargmax P(q|d) ã¯ã¤ã¾ãããææ¸ d ãã©ããã q ã§æ¤ç´¢ãããããããã§ãããã® P(q|d) ã軸ã«æ å ±æ¤ç´¢ãèããã¢ãã«ã¯ãã¯ã¨ãªæå°¤ã¢ãã«ãã¨å¼ã°ãã¾ãã
ããã§è¨èªã¢ãã«ãç¨ãã㨠P(q|d) â P(q|Md) ã¨ã¿ãªããã¨ãã§ãã¾ããçµå±ãåææ¸æ¯ã«è¨èªã¢ãã« Md ãæ¨å®ãããã¨ã«ãããå ã« "frog said ..." ã§æååã®ç起確çãæ±ããã®ã¨åæ§ã®æ¹æ³ã§ããããã®ææ¸ã® P(q|Md) ã®å¤ã決ã¾ãã¾ãããããããæ¯è¼ãããã¨ã§ã¹ã³ã¢ãªã³ã°ãå¯è½ã¨ãªãã¾ãã
è¨èªã¢ãã«ã®æ¨å®
è¨èªã¢ãã«ãæ¨å®ããã«ã¯ãã¾ãã©ã®ç¨®é¡ã®è¨èªã¢ãã«ãåæã«ããããèãã¾ãã代表çãªè¨èªã¢ãã«ã«ã¯
- Nã°ã©ã ã¢ãã«
- é ããã«ã³ãã¢ãã«
- 確çææ³ (probabilistic context-free grammer)
ãªã©ãããã¾ããIIR ã§ã¯ãIRã·ã¹ãã ã§ã¯(é³å£°èªèãªã©ã«æ¯è¼ãã¦) æèæ§é ãããã¾ã§èæ ®ããå¿ è¦ããªãNã°ã©ã ã¢ãã«ã§ååã¨ãã¦ãNã°ã©ã ã¢ãã«ãæ¡ç¨ãã¦ãã¾ããã¾ããããã¾ã§ã«è¦ã¦ããããã« bag of words ã§ãå¹æçã¨ãããã¨ã§ unigram ã§èãã¾ãã
è¨èªã¢ãã«ã®æ¨å®ã¢ã«ã´ãªãºã ã«ãå¹¾ã¤ãã®ææ³ãããã¾ãããããã§ã¯å¦ç¿ãã¼ã¿ã®çæ確çãæ大åããæå°¤åç (maximum likelihood principle) ã«åºã¥ããæå°¤æ¨å®æ³ (maximum likelihood method = MLE) ãç¨ãã¾ãã
unigram ãä»®å®ã MLE ãç¨ããã¨ãææ¸ä¸ã«åºç¾ããåèªã®ç¸å¯¾é »åº¦ã«ããè¨èªã¢ãã«ãæ¨å®ããããã¨ããããã¾ãããã ãç¸å¯¾é »åº¦ããã®ã¾ã¾ä½¿ãã¨ãå¦ç¿ãã¼ã¿ã«åºç¾ããªãåèªãã¢ãã«å ¨ä½ã®ç¢ºçã 0 ã«ãã¦ãã¾ãã¨ãããã¼ãé »åº¦åé¡ããåé¡ã«ãªãã¾ããããã§ç¢ºçåå¸ãã¹ã ã¼ã¸ã³ã°ããããã§ããããã®ã¹ã ã¼ã¸ã³ã°ã«ã¯ããã¾ã§ã«æ§ã ãªææ³ãææ¡ããã¦ããããã§ããIIR ã§ã¯ 11 ç« ã§ãè¦ãå ç®ã¹ã ã¼ã¸ã³ã°(ã©ãã©ã¹æ³)ã«å ããç·å½¢è£éæ³ããã¤ã¸ã¢ã³ã¹ã ã¼ã¸ã³ã°ãªã©ãç´¹ä»ããã¦ãã¾ãã12ç« ã§ã¯ãç·å½¢è£éæ³ãç¨ãããã¾ãã
unigram ãä»®å®ã MLE ã§è¨èªã¢ãã«ãæ¨å®ãç·å½¢è£éæ³ã§ã¹ã ã¼ã¸ã³ã°ãããã¨ã«ãã P(d|q) ãæ±ããæ°å¼ãå¾ããã¾ãã(ã¹ã©ã¤ã24æç®) ãã®æ°å¼ã«å®éã®ææ¸ç¾¤ãå½ã¦ã¯ãããã¨ã§ P(d|q) ã«åºã¥ããã¹ã³ã¢ãªã³ã°ãå¯è½ã«ãªãã¾ãã(ã¹ã©ã¤ã25æç®)
æ¬ç« ãèªã¿é²ããã«ããã£ã¦ã¯ è¨èªã¨è¨ç® (4) 確ççè¨èªã¢ãã« ãåèã«ãªãã¾ãããèªåã«ã¯å°ã æ·å± ã®é«ãå 容ã§ããããIIR 12ç« ã¨ä½µèªãããã¨ã§ç解ãæ·±ã¾ãã¾ããã
è¨èªã¨è¨ç® (4) 確ççè¨èªã¢ãã«
- ä½è : åç äº,è¾»äºæ½¤ä¸
- åºç社/ã¡ã¼ã«ã¼: æ±äº¬å¤§å¦åºçä¼
- çºå£²æ¥: 1999/11
- ã¡ãã£ã¢: åè¡æ¬
- è³¼å ¥: 10人 ã¯ãªãã¯: 91å
- ãã®ååãå«ãããã° (44件) ãè¦ã
ã¾ãã以ä¸ã® URL ã®è¨äºãåèã«ãã¾ããã
次å輪è¬ã»ã
ä»æ¥ã®è¼ªèªä¼ã第13ç« ã¯å°ãå 容ãå¤ãã£ã¦ "Text classification and Naive Bayes" ã§ãããç¶ã 14 ç« ã¯ Vector space classificationã15 ç« ã SVM ã¨ãããã¨ã§ãã°ããã¯æ©æ¢°å¦ç¿ã«ããããã¹ãåé¡ããã¼ãã«ãªãã¾ããã ãã ãã¨çè«ã®è§£èª¬ã主ã¨ãªããå®è·µã¨ã®ã®ã£ãããéãã¦ããæãããã¾ããé©å½ãªå®è£ ãä½ã£ã¦æ¤è¨¼ãããªããã¦ããããã¨ããã§ãã
次åã®è¼ªèªä¼ã¯ 10/18 (å) äºå®ã次å輪èªä¼å¾ããã¤ãéã復ç¿è³æãã¢ãããã¾ãã
éå»ã®ç« ã®å¾©ç¿è³æ ppt ã¯å URL ã®ãã£ã¬ã¯ã㪠(http://bloghackers.net/~naoya/iir/ppt/) ããä¸è¦§å¯è½ã§ãã
Introduction to Information Retrieval
- ä½è : Christopher D. Manning,Prabhakar Raghavan,Hinrich Schuetze
- åºç社/ã¡ã¼ã«ã¼: Cambridge University Press
- çºå£²æ¥: 2008/07/07
- ã¡ãã£ã¢: ãã¼ãã«ãã¼
- è³¼å ¥: 7人 ã¯ãªãã¯: 115å
- ãã®ååãå«ãããã° (37件) ãè¦ã