Thoughts on Information Retrieval, Search Engines, Data Mining, Science, Engineering, and Programming source: http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf There is a kind of buzz about Probabilistic Latent Semantics Indexing, so this post goes. From VSM to LSI Prior to 1988 the prevalent IR model was Saltonâs Vector Space Model (VSM). This model treats documents and queries as vec
id:naoya ããã®Latent Semantic Indexing ã®è¨äºã«è§¦çºããã¦ããã1é±éã»ã©ã¡ããã¡ããè¦ã¦ããè¡åã®è¿ä¼¼è¨ç®ææ³ã«ã¤ãã¦æ¸ãã¦ã¿ããããã§ããããã®ã¯åèª-ææ¸è¡å(ã©ã®åèªãã©ã®ææ¸ã«åºã¦ãããã®å ±èµ·è¡å)ãè³¼å ¥è -ã¢ã¤ãã è¡å(ã©ã®äººãã©ã®æ¬ãè²·ã£ããã¨ããæ¨è¦ã¨ã³ã¸ã³ã§ä½¿ãè¡å)ããã¼ã¸-ãªã³ã¯è¡å(ã©ã®ãã¼ã¸ããã©ã®ãã¼ã¸ã«ãªã³ã¯ãåºã¦ãããããããã¯ãªã³ã¯ãããã£ã¦ããããPageRank ãªã©ãã¼ã¸ã®ã©ã³ãã³ã°ã®è¨ç®ã«ä½¿ã)ãã¨ãã£ããããªè¡åãè¨ç®ããã¨ãã大è¦æ¨¡è¡åã ã¨è¨ç®éã»è¨æ¶ã¹ãã¼ã¹ã¨ãã«è¨å¤§ãªã®ã§ãäºåã«ããç¨åº¦è¨ç®ãã¦ãããã®ã§ããã°ãã§ããã ãå°ãããã¦ãããã(ããã¦å¯è½ãªãã°ç²¾åº¦ãä¸ããã)ãã¨ããææ³ã§ããã è¡åã®å§ç¸®ã«ã¯å ã®è¡åã A (mè¡nå)ã¨ãã㨠A = USV^T ã¨ããããã«3ã¤ã«å解ãããã¨ãå¤ãããã
æ å ±æ¤ç´¢ã«ããããã¯ãã«ç©ºéã¢ãã«ã§ã¯ãææ¸ããã¯ãã«ã¨ã¿ãªãã¦ç·å½¢ç©ºéã§ãããæ±ãã¾ãããã®ææ¸ãã¯ãã«ã¯ãææ¸ã«å«ã¾ããåèªã®åºç¾é »åº¦ãªã©ãæåã«åãã¾ããçµæã以ä¸ã®ãããªåèªææ¸è¡å (term document matrix) ãå¾ããã¾ãã d1 d2 d3 d4 Apple 3 0 0 0 Linux 0 1 0 1 MacOSX 2 0 0 0 Perl 0 1 0 0 Ruby 0 1 0 3 ãã®åèªææ¸è¡åã«å¯¾ãã¦å ç©ã«ããé¡ä¼¼åº¦ãªã©ã®è¨ç®ãè¡ã£ã¦ãæ å ±è¦æ±ã«é©åããææ¸ãæ¢ãã®ããã¯ãã«ç©ºéã¢ãã«ã«ããæ¤ç´¢ã¢ãã«ã§ãã è¦ã¦ã®éããåèªææ¸è¡åã®æ¬¡å æ°ã¯ç´¢å¼èªã®ç·æ°ã§ããææ¸ãå¢ããã°å¢ããã»ã©æ¬¡å ã¯å¢å ããå¾åã«ããã¾ããä¾ãã°ç´¢å¼èªã100ä¸èªãã£ã¦æ¤ç´¢å¯¾è±¡ã®ææ¸ã 1,000ä¸ä»¶ããã¨ã100ä¸æ¬¡å * 1,000ä¸ã¨ãã大ããã®è¡åãæ±ããã¨ã«ãªãã¾ãããå
id:naoyaããããã¤ããããªã©ã®è¶ æå人ãªæ¹ã ã以åããå®æ½ããã¦ãããIIR輪èªä¼ãã¨ãããã®ãããã¾ãã¦ãã©ãããä»åã¯ç¬¬18ç« ã® "Matrix decompositions and latent semantic indexing"ã輪èªããããã§ãã http://d.hatena.ne.jp/naoya/20090208 http://chalow.net/2009-02-08-2.html Latent Semantic Indexingã¨ã¯ãé称LSIã¨ãLSAï¼Latent Semantic Analysisï¼ã¨ãããã¾ãããæ¥æ¬èªã ã¨ãæ½å¨çæå³ã¤ã³ããã·ã³ã°ããªãã¦å¼ã³ã¾ããã ç°¡åã«è¨ã£ã¦ã¿ã㨠ã§ã£ãããããªãã¯ã¹ï¼æ°ä¸Ãæ°ä¸ã¨ãã®è¡åï¼ããã¨ãã°ãæ°ç¾Ãæ°ä¸ããããã«ã¾ã§ããã ãã ã£ã¨æ¨ªã«æ¼ãã¤ã¶ãããã«å§ç¸®ãã¦ã¿ãã¨ãããä¸æè°ãã®ãã£ãè¡åã¯ã¨ã¦ãæå³
ã¡ãã£ã¨é£ã°ãã¦ï¼å ã«IIR18ç« ãèªãã§ã¿ãï¼åèªææ¸è¡åãç¹ç°å¤å解ãã¦æ°ãã空éã§ãã¯ãã«ç©ºéã¢ãã«ã使ãã¨ããLSIã®è©±ï¼ ãã¼ã¸æ°ãå°ãªãã£ãã®ã§ï¼éãå·®ãã¦ç¿»è¨³ããã¦ã¿ãï¼ããã«æ°å¼ãå¤ãã®ã§TeXã§æ¸ãã¦ã¿ãï¼ããã¾ã§æ¥ãããã ãããã¨ABåã®æªãçãåºã¦ï¼æ°å¼ãæ¼ç¿ãå ¨é¨è¨³ãã¦ã¿ãï¼ã¤ãã«ãã¨ãªã£ã¦ãã£ã¦ãã¾ã£ãï¼ä»ã¯å ¬éãã¦ããï¼ã§ãåçã¯ãã¦ããªãï¼ã¾ã ãã£ã¤ãã®é¨åãããã®ã§ãã¤ãã¤ã¨ãã¼ã¸ã§ã³ã¢ãããã¦ãã¾ãï¼ Introduction to information retrieval: 18 Matrix decomposition and latent semantic indexingï¼åè¨³ï¼ å¤§ä½1ãã¼ã¸1æéï¼ãã¤ãã¤å¤ãªã¹ããã¦3æ¥éããããããã¾ããï¼å¦ãå¿ã§ãç²¾èªããã®ã§ï¼ã¨ã¦ãç解ãæ·±ã¾ãã¾ããï¼ãã£ããèªãã®ã翻訳ä½æ¥ãã¨ã¦ã楽ããã£ãã®ã§ï¼ãªã
ã©ã³ãã³ã°
ãç¥ãã
ã©ã³ãã³ã°
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}