èªç¶è¨èªè§£æ in MONMOï¼ä¸ç·¨ï¼
ä¸é£ã®èªç¶è¨èªå¦çãMONMOã¡ããä¸ã§å®ç¾ãã試ã¿ã®ç¬¬ï¼å¼¾
ååã¯å½¢æ
ç´ è§£æã¾ã§è¡ã£ãã
ä»åã¯ãå½¢æ ç´ è§£æçµæããããã®ããã¥ã¡ã³ãã®ç¹å¾´ã表ãããã¯ãã«ããç®åºããããã¯ã¿ã©ã¤ãºãè¡ãã
monmo-NLProcessing
TF-IDF
èªç¶è¨èªå¦çã«ããã代表çãªãã¯ã¿ã©ã¤ãºææ³ã
èãæ¹
- ããã¥ã¡ã³ãä¸ãä½åãåºç¾ããåèªã¯ãã®ããã¥ã¡ã³ãã表ãéè¦ãªåèªã§ããã
- å¤ãã®ããã¥ã¡ã³ãä¸ã«åºç¾ããåèªã¯æ®éçãªåèªãªã®ã§éè¦ã§ã¯ãªãã
ã·ã³ãã«ã ã
TF-IDFã®è¦ç´
- N
- ç·ããã¥ã¡ã³ãæ°
- TF[a]
- ããåèªï¼aï¼ããã®ï¼ããã¥ã¡ã³ãä¸ã«ç¾ããåæ°
- DF[a]
- ããåèªãç¾ããããã¥ã¡ã³ãæ°
- IDF[a]
- log( N / DF[a] )
- TF-IDF[a]
- TF[a] x IDF[a]
TF-IDFã®ä¾
ç§ã¯ãããèªã¿ããã
ç§ã¯ãããæ¸ãããã
"ç§ã¯ãããèªã¿ããã"={ "ç§" : 0, "ã¯" : 0, "ãã" : 0, ã"ã" : 0, "èªã¿" : 0.3, "ãã" : 0 } "ç§ã¯ãããæ¸ãããã"={ "ç§" : 0, "ã¯" : 0, "ãã" : 0, ã"ã" : 0, "æ¸ã" : 0.3, "ãã" : 0 }
- N=2
- ãç§ããã¯ããããããããããããã®IDFã¯log(2/2) = log(1) = 0; ãã£ã¦TF-IDFã0ã¨ãªãã
- ãèªã¿ããæ¸ããã®IDFã¯log(2/1) = log(2) = ç´0.3;
- ãèªã¿ããæ¸ããã®TFã¯ããããã1
ããã¯ããã¾ã§ä¾ã
å®éã¯DF=1 ã®åèªã¯ä»ã®ããã¥ã¡ã³ãã¨çµã³ä»ããããç¡ããç¡è¦ãã¦è¯ãã
ï¼å¦çéãæ¸ããããã«ç©æ¥µçã«åãã¹ãï¼
ãããªè¨³ã§
ä»åãããããMONMOã¡ããã®MAPã¸ã§ãã使ã£ã¦å¤§éï¼ï¼ï¼ã®ããã¥ã¡ã³ãã¨åè©ã並åã«å¦çããã
monmo-NLProcessing/vectorize
å¦çé
- tokenizeçµæããTFãç®åº
- TFããDFãç®åº
- DFããIDFãç®åº
- TF & IDF ããTIF-IDFãç®åº
MongoDBã®ç¹æ§ãèæ
®ããã¨ããã®é ãä¸çªå¹çãè¯ãã ããã
ãã¯ã¿ã©ã¤ãºæºå
ååã®æé ã§å®äºãã¦ããã
ãã£ã¬ã¯ããªã移åããã ã
cd monmo-NLProcessing/vectorize
ãã¯ã¿ã©ã¤ãºï¼ç°¡æçï¼
./vectorize.sh -s test.token.sampledoc
- ææ¸æ¤ç´¢
- TFçµæã使ãã¨é«åº¦ãªæ¤ç´¢ãåºæ¥ãã
./fulltext_search.sh -s test.vector.tf.token.sampledoc -w 'ã¯ãã¿ãç³ãªã©' -V = META = { "dic" : "analysis.dictionary", "doc" : "test.sampledoc", "doc_field" : "body", "docs" : 73, "normalize" : true, "tf" : "test.vector.tf.token.sampledoc", "token" : "test.token.sampledoc", "type" : "TF" } = DIC = 5212ed32b399b667b8567608 => ã¯ãã¿ 5212ed33b399b667b85685b3 => ã 5212ed55b399b667b858df1f => ç³ 5212ed32b399b667b85671f3 => ãªã© = QUERY = { "value.w" : { "$all" : [ "5212ed32b399b667b8567608", "5212ed33b399b667b85685b3", "5212ed55b399b667b858df1f", "5212ed32b399b667b85671f3" ] } } = DOCS = [ ObjectId("51e64d60c507ed1f43d21400") ] = VERBOSE = * 51e64d60c507ed1f43d21400 : æãç´åºå ¸:ããªã¼ç¾ç§äºå ¸ãã¦ã£ãããã£ã¢ï¼Wikipediaï¼ã移å:æ¡å ãæ¤ç´¢ãã®é ç®ã§ã¯ãç´ãæãéã³ã«ã¤ãã¦è¨è¿°ãã¦ãã¾ãã"æç´"ã"æãç´"ã®ä»ã®ç¨
ãã¯ã¿ã©ã¤ãºï¼æ£è¦æé ï¼
ããï¼ï¼TF
./tf.sh -s test.token.sampledoc -o test.vector.tf.token.sampledoc
ããï¼ï¼DF
./df.sh -s test.vector.tf.token.sampledoc -o test.vector.df.token.sampledoc
ããï¼ï¼IDF â»è¦ãã¥ã¼ãã³ã°
./idf.sh -s test.vector.df.token.sampledoc -o test.vector.idf.token.sampledoc
ããï¼ï¼TF-IDF
./tfidf.sh -s test.vector.idf.token.sampledoc -o test.vector.tfidf.token.sampledoc
IDFãã¥ã¼ãã³ã°
ãã®ãã§ã¼ãºã§è²ã ãªãã¥ã¼ãã³ã°ãããã
ããï¼ï¼DFã¨IDFã確èªããã
./view_df.sh -s test.vector.df.token.sampledoc ./view_df.sh -s test.vector.idf.token.sampledoc
ããï¼ï¼ä¸ã確èªããªãããlimit,threshold,verb-onlyã®å¤ã調æ´ããã
- limit
- ï¼DF / ç·ããã¥ã¡ã³ãæ°ï¼ã®æ大å¤
- threshold
- DFã®æå°å¤
- verb-only
- åè©ã ãæ½åº
- åãæ¨ã¦éãã®å ´åã¯limitå¤ãä¸ãã
./idf.sh --limit 0.3 -s test.vector.df.token.sampledoc -o test.vector.idf.token.sampledoc
- ãã£ã¨åãæ¨ã¦ããå ´åã¯limitå¤ãä¸ãã
./idf.sh --limit 0.5 -s test.vector.df.token.sampledoc -o test.vector.idf.token.sampledoc
- åè©ã ããè©ä¾¡ããï¼å¤§æµã®å ´åãæ¡ä»¶ãç·©ãããæ¹ãè¯ãï¼
./idf.sh --limit 0.3 --verb-only -s test.vector.df.token.sampledoc -o test.vector.idf.token.sampledoc
ããï¼ï¼çµæãè¯ããªãã¾ã§ç¹°ãè¿ã
ããï¼ï¼TF-IDFãåç£åº
./tfidf.sh -s test.vector.idf.token.sampledoc -o test.vector.tfidf.token.sampledoc
ã¾ã¨ã
ããã§ãããã¥ã¡ã³ãããã¯ãã«åããæã¾ã§åºæ¥ãMONMOã¡ããã
ãã¯ãã«ã¯ããã¥ã¡ã³ãã®ç¹å¾´ãæ°å¦çã«ãã©ã¡ã¼ã¿åãããã®ã§ãå¾ã¯æ°å¦çææ³ã«ãã£ã¦è²ã
ãªä½¿ãæ¹ãåºæ¥ãã
次åã¯ãã¯ãã«ã®ã¯ã©ã¹ã¿ãªã³ã°ãè¡ãäºå®ã
ããã¥ã¡ã³ãã®é¢é£åº¦ããã°ã«ã¼ãåãªã©ãåºæ¥ãï¼ï¼
ä¿®æ£
TF=1 => DF=1 ã®åèªã¯ã»ã»ã»
ãææãããã¨ããããã¾ãï¼ï¼