cles::blog 平常å¿æ¯é blogs: cles::blog NP_cles() « GPLãVersion3ã« :: èªåã®æºå¸¯çªå·ãè¦ãã¦ããªã » 2007/07/01 Yahoo!APIã§ç¹å¾´èªæ½åºãä½ã YahooAPI nlp 78 4ã¸ã NP_MetaTagsã§ã¯metaã¿ã°ã®keywordsãè¨äºã®æ¬æããç¹å¾´å¾ãæ½åºãã¦èªåçã«çæããæ©è½ããããããã¯Bulkfeedsã®ç¹å¾´èªæ½åºAPIã使ããã¦ããã£ã¦ãã¾ãããã¨ãããããã®ã¨ããBulkfeedsãè½ã¡ãã¾ã¾ãªã®ã§å¥ã®æ¹æ³ã§ç¹å¾´èªæ½åºãã§ããæ¹æ³ããªããæ¢ãã¦ã¿ã¾ããã [ã] å½¢æ ç´ è§£æã¨æ¤ç´¢APIã¨TF-IDFã§ãã¼ã¯ã¼ãæ½åº ç®çï¼ãã¼ã¯ã¼ãæ½åºå¯¾è±¡ããã¹ãããããã®ããã¹ãã代表ãã ãã¼ã¯ã¼ããæ½åºãã¾ããTF-IDF ã¨ããææ¨ãç¨ãã¾ããï¼ãã®å¤ã大ããã»ã©ãã®åèªã代表ãã¼ã¯ã¼ãã£ã½ã
以åã«k-means++ãPerlã§æ¸ããã®ã§ãããå®éã«è©¦ããã¼ã¿ããªãã£ãã®ã§ãã®ã¾ã¾æ¾ç½®ãã¦ã¾ããããã£ãããªã®ã§å¤§ããªãã¼ã¿ã§è©¦ãã¦ã¿ããã®ã§ãä»åã¯ä¸æºåã¨ãã¦wikipediaã®åãã¼ã¯ã¼ãã«å¯¾ãããã®ç¹å¾´ã表ããã¼ã¿ãæ½åºãããã¨æãã¾ããããã¦ä»åä½ã£ããã¼ã¿ã使ã£ã¦ãk-meansãé層çã¯ã©ã¹ã¿ãªã³ã°ãªã©ä»ã®ææ³ãããã試ãã¦ã¿ãäºå®ã§ãã ä»åã¯ç¹å¾´éã¨ãã¦ãã¿ã«TFIDFã使ããã¨ã¨ãã¾ããTFIDFã«ã¤ãã¦ã¯ãä¸è¨ã®ãã¼ã¸ã詳ãããããã¡ãããåç §ãã ããã å½¢æ ç´ è§£æã¨æ¤ç´¢APIã¨TF-IDFã§ãã¼ã¯ã¼ãæ½åº tf-idf - Wikipedia ã¾ãWikipediaã®ãã¼ã¿ããã¦ã³ãã¼ããã¦ãã¾ãã以ä¸ã®ãã¼ã¸ããããjawiki-latest-pages-articles.xml.bz2ãããã¦ã³ãã¼ããã¦ãã ããã http://download.wik
æ¬æãå½¢æ ç´ å解ããå¿ è¦ãªåè©ãtfãã¼ãã«ã¨dfãã¼ãã«ã«å ¥ãããåæ対象ã¨ãªãææ¸ç¾¤ãã¹ã¦ã«ã¤ãã¦ãã®å¦çãè¡ããåå½¢æ ç´ ã®TF-IDFå¤ãæ±ãã¦ææ¸ããã¯ãã«åãããä»ã®ææ¸ãã¯ãã«ã¨å ç©ãæ¯è¼ããå°ããé ã«ãä¼¼ã¦ããè¨äºããæ±ããã (ã¯ã©ã¹ã¿ãªã³ã°ã¨ãã¯å¥é)ã Harmanã«ããTFå¤ã®æ£è¦åã¨Sparok Jonesã«ããDFå¤ã®æ£è¦åãããå ´åã®TF-IDFå¤ã®è¨ç®å¼ã¯ä»¥ä¸ã®ããã«ãªã (åèæç®): tfidf(i,j) = log2(freq(i,j) + 1) / log2(NoT) * (log2(N / Dfreq(i)) + 1)
æ å ±æ¤ç´¢ã®åéã§ãã使ãããã¢ã«ã´ãªãºã ã§ãTF/IDFãã¨ãããã®ãããã¾ãã ããã¥ã¡ã³ãã®ä¸ãããç¹å¾´èªããæ½åºãããã¨ãã£ããããªç¨éã§ãã使ããã¦ãã¾ãã TF/IDFã¢ã«ã´ãªãºã ã®ãããã解説ã¯ããã¨ããããè¦ã¦ãã ããã ä»åã¯ãã®TF/IDFã®è¨ç®ããç°¡åãã«å®ç¾ããããã®perlã¢ã¸ã¥ã¼ã«ãCPANã«ä¸ãã¾ããã®ã§ããç´¹ä»ãã¾ãããªã¾ãã¯Lingua::JA::TFIDFã¨ããã¾ãã Lingua::JA::TFIDF - TF/IDF calculator based on MeCab. http://search.cpan.org/~miki/Lingua-JA-TFIDF TF/IDFå®è£ ã®å°ãã©ãã TF/IDFã®å®è£ ã試ã¿ãæ¹ã§ããã°ãããã¨æãã®ã§ãããå®éã«ãããã¨ããã¨ãTFï¼Term Frequencyï¼ã®è¨ç®ã¯ãªããé£ããããã¾ããããIDFï¼Inve
ãã¤ã¼ããã¤ãºåé¡å¨ã®ã½ã¼ã¹ã³ã¼ããæ´çãã¦ããã¨ãã«ãåèªãã¼ã¿ãã¼ã¹ãä½æãã¦ããã®ã ãã TF-IDF ã«åºã¥ãã¦éè¦åèªã®æ½åºãåºæ¥ãã®ã§ã¯ãªããã¨æã£ãããã§ãã TF-IDF ã¯æ å ±æ¤ç´¢ã®èãæ¹ãªã®ã§ãéè¦åèªã®æ½åºå ã¨ãªãææ¸ã¯ãæ¢ã«å¦ç¿æ¸ã¿ã®ææ¸éåã«å«ã¾ãã¦ããã¨ããåæããè¨ç®ããã¾ãï¼ãã¶ãï¼ãã¨ãããã¨ã§ãå¦ç¿ããã¦ããªãå ´å㯠DF ã 0 ã«ãªãå¯è½æ§ãããããã§ãæ¼ç®ãä¸å¯è½ï¼ã¼ãé »åº¦åé¡ï¼ï¼ããã¤ã¼ããã¤ãºã調ã¹ãã¨ãã«ç¥ã£ãå ç®ã¹ã ã¼ã¸ã³ã°ã«ä¼¼ãææ³ãç¨ãããã¨ã«ãããã©ãè¯ãã®ã ãããâ¦ã TF-IDF ã®åºã«ãªã£ã¦ãããç´¢å¼èªã®éã¿ä»ãï¼term weightingï¼ã«é¢ãã¦ã調ã¹ã¦ã¿ãã ã»å±æçéã¿ ï¼local weightï¼ ã»å¤§åçéã¿ ï¼global weightï¼ ã»ææ¸æ£è¦åä¿æ° ï¼document normalization fact
ãã£ã¼ã¨ãã¢ã¤ãã£ã¼ã¨ã TFã»IDF ç´¢å¼èªã®éã¿ä»ãæ¹æ³ã®ã²ã¨ã¤ã TF(Term Frequency)ã¯ææ¸dã«ç½®ããæ¤ç´¢èªtã®é »åº¦ IDF(Inverted Document Frequency)ã¯ç´¢å¼èªãç¾ããç¸å¯¾ææ¸é »åº¦ã®éæ°ã®å¯¾æ° ææ¸æ°Nã¨ç´¢å¼èªtãä¸å以ä¸åºç¾ããææ¸ã®æ°df(t)ãã£ã¦æ¬¡å¼ã®ããã«å®ç¾©ãããã IDF(t) = log10 (N / DF(t)) ãã®ä¸¡è ã®ç©ãåããã¨ã§ãç´¢å¼èªã®éã¿ä»ããè¡ãã ex.ä¸ææ¸ä¸ã«åãç´¢å¼èªãå¤ãåºç¾ããã°ãTF-IDFã®å¤ã¯å¤§ãããªãã ã¾ããå¤ãã®ææ¸ã«ç´¢å¼èªãåºç¾ããã°ãå¤ã¯å°ãããªãã [ç·¨é] TFã»IDF ã«ããéè¦åº¦ ææ¸ãç¹å¾´ä»ãããã¼ã¯ã¼ãã«ãªããããªã¿ã¼ã ã®æ§è³ªã¨ãã¦ããã®ææ¸ã«æ°å¤ããã¤ã¾ãé«ãé »åº¦ã§ç¾ããï¼TFï¼ãå°ãªãæ°ã®ææ¸ã«ããç¾ããªãï¼IDFï¼ãã¨ãããµãã¤ãèãããããã¯ã·ã³ãã«ã ãã
ææ¥ã§tfidfãåå¼·ãã¦ã¡ãã£ã¨åããã¥ããã£ãã®ã§ã¾ã¨ãã¦ããã tfidfã¨ã¯ï¼ æ å ±æ¤ç´¢ã§ä½¿ãã¢ã«ã´ãªãºã ã®ä¸ã¤ã ããããã®åèªã«éã¿ãã¤ãã¦ãã¯ã¨ãªã¼ããææ¸ããã¯ãã«ç©ºéã§è¡¨ã ææ¸ã¨ã¯ã¨ãªã¼ã®é¡ä¼¼åº¦ã§ã©ã³ã¯ä»ããè¡ãã ãã®å¤ãé«ãã»ã©éè¦ã tfidf = w = tfã»idf w:éã¿ã¨ãããã¨ã tfã¨ã¯ï¼ Term frequency(åèªåºç¾é »åº¦) åãææ¸ã«ä½åãç¾ããåèªã»ã©æ¤ç´¢ã®æåãªæãããã ã¤ã¾ãä¸ã¤ã®ææ¸ã®ä¸ã«å¤ãæ¸ããã¦ãåèªãæ¢ãã£ã¦ãã¨ãã f =frequency of term in a document åèªãä¸ã¤ã®ææ¸ã§åºç¾ããé »åº¦ ã¤ã¾ããã©ã¦ã¶ä¸ã§ Ctrl-Fã¨ã使ã£ã¦ããåèªãæ¤ç´¢ããã¨ãã«ããããããæ° tf = f/max(f) =ãåèªã®é »åº¦/æç« ã§åºç¾ããåèªã®ä¸ã§ä¸çªå¤ãåèªã®æ° ä¿®æ£(2009 1/6)ãtf = f
æ å ±æ¤ç´¢ã®åéã«ããã¦ãtfâidf (ã¾ãã¯ã TF*IDFãTFIDFãTFâIDFãTfâidf)ã¯ãterm frequencyâinverse document frequencyã®ç¥ã§ãããã³ã¼ãã¹ãåéãããææ¸ç¾¤ã«ããã¦ãããåèªãããã«éè¦ãªã®ããåæ ããããã¨ãæå³ããçµ±è¨éï¼æ°å¤ï¼ã§ãã[1]ãã¾ããtf-idfã¯æ å ±æ¤ç´¢ããããã¹ããã¤ãã³ã°ãã¦ã¼ã¶ã¼ã¢ããªã³ã°ï¼è±èªçï¼ã«ãããéã¿ä¿æ°ï¼è±èªçï¼ã«ãããç¨ãããããããåèªã®tf-idfã®å¤ã¯ææ¸å ã«ããããã®åèªã®åºç¾åæ°ã«æ¯ä¾ãã¦å¢å ããã¾ãããã®åèªãå«ãã³ã¼ãã¹å ã®ææ¸æ°ã«ãã£ã¦ãã®å¢å ãç¸æ®ºãããããã®æ§è³ªã¯ãä¸è¬ã«ããã¤ãã®åèªã¯ããåºç¾ããããã¨ããäºå®ããã¾ã調æ´ãããã¨ã«å½¹ç«ã£ã¦ãããä»æ¥ãtf-idfã¯ãã£ã¨ãæåãªèªã®éã¿ã¥ã(term-weighting)ææ³ã§ããã2015å¹´ã«è¡ãããç 究
å½¢æ ç´ è§£æã¨æ¤ç´¢APIã¨TF-IDFã§ãã¼ã¯ã¼ãæ½åº 2005-10-12-1 [Programming][Algorithm] å½¢æ ç´ è§£æå¨ã¨ Yahoo! Web æ¤ç´¢ API 㨠TF-IDF ã使ã£ã¦ãã¼ã¯ã¼ãæ½ åºããã¨ããå æ¥ã®æ¤ç´¢ä¼è°ã§ã®ãã¢ãKEYAPI[2005-09-30-3]ã æç§æ¸ã«è¼ã£ã¦ãããããªåºæ¬ä¸ã®åºæ¬ã§ãããããããã¦ã¨ãã»ã³ã¹ã ç°¡åãªä¾ã§è§£èª¬ãããã¨æãã¾ãã ç®çï¼ãã¼ã¯ã¼ãæ½åºå¯¾è±¡ããã¹ãããããã®ããã¹ãã代表ãã ãã¼ã¯ã¼ããæ½åºãã¾ããTF-IDF ã¨ããææ¨ãç¨ãã¾ããï¼ãã®å¤ã大 ããã»ã©ãã®åèªã代表ãã¼ã¯ã¼ãã£ã½ãã¨ãããã¨ã§ãããããï¼ TF-IDF ãè¨ç®ããããã«ã¯ã (1) ãã¼ã¯ã¼ãæ½åºå¯¾è±¡ããã¹ãä¸ã®ä»£è¡¨ãã¼ã¯ã¼ãåè£åºç¾æ° (TF)ã (2) å ¨ã¦ã®ããã¥ã¡ã³ãæ° (N)ã (3) 代表ãã¼ã¯ã¼ãåè£ãå«ã¾ããããã¥ã¡
ãªãªã¼ã¹ãé害æ å ±ãªã©ã®ãµã¼ãã¹ã®ãç¥ãã
ææ°ã®äººæ°ã¨ã³ããªã¼ã®é ä¿¡
å¦çãå®è¡ä¸ã§ã
j次ã®ããã¯ãã¼ã¯
kåã®ããã¯ãã¼ã¯
lãã¨ã§èªã
eã³ã¡ã³ãä¸è¦§ãéã
oãã¼ã¸ãéã
{{#tags}}- {{label}}
{{/tags}}