Windowsã§MeCab Pythonã使ã
æ¥æ¬èªã®æç« ãåèªã«åå²ããã«ã¯å½¢æ ç´ è§£æã使ãã¾ããæ¥æ¬èªã®å½¢æ ç´ è§£æã«ã¯ãChaSenãMeCabãYahoo!形態素解析ãªã©ãããã¾ãããã¤ã¼ããã¤ãºãç¨ããããã°è¨äºã®èªååé¡ï¼2010/7/3ï¼ã§MeCabãPythonãã使ãæ¹æ³ãç°¡åã«ã¾ã¨ãã¾ããããMeCabã¯ãã使ãã®ã§å度ã¾ã¨ãç´ãã¦ç¬ç«ããã¨ã³ããªã«ãã¾ãããYahoo!å½¢æ ç´ è§£æã®ä½¿ãæ¹ã¯ãYahoo!å½¢æ ç´ è§£æAPIï¼2009/4/15ï¼ã§æ¸ãã¾ããã
Windowsã¸ã®å°å ¥æ¹æ³
MeCabã¯é«æ§è½ãªå½¢æ ç´ è§£æã¢ã¸ã¥ã¼ã«ã§Pythonã RubyãPerlãJavaãªã©ãã¾ãã¾ãªè¨èªãã使ãã¾ããMac OS Xã¨Linuxã§ã¯ç°¡åã«ã³ã³ãã¤ã«ãã¦ã¤ã³ã¹ãã¼ã«ãã§ããã®ã§ãããWindowsã§ã¯MinGWãVisual Studioã®ã¤ã³ã¹ãã¼ã«ãã³ã¼ãã®ä¿®æ£ãå¿ è¦ã§ããªãé¢åããããããã§ãPythonã¢ã¸ã¥ã¼ã«ã¯id:fgshunãããã³ã³ãã¤ã«ãããã¤ããªã使ããã¦ãããã¾ããã以ä¸ãå°å ¥æ¹æ³ã§ãã
- MeCabの本サイトã§ãã¦ã³ãã¼ãããWindowsçã®mecab-0.98.exeãã¤ã³ã¹ãã¼ã«ï¼è¾æ¸ã¯UTF-8å½¢å¼ãç¡é£ã§ãï¼
- å½¢æ ç´ è§£æã¨ã³ã¸ã³ MeCab 0.98pre3 éè¯ãã«ããããã¦ã³ãã¼ãããlibmecab-1.dllãMeCab.pyã_MeCab.pydãããã±ã¼ã¸ãã©ã«ãï¼Python2.6ãªãC:\Python26\Lib\site-packagesï¼ã«ã³ãã¼ãIPAè¾æ¸ã¯mecab-0.98.exeã§ã¤ã³ã¹ãã¼ã«ããã®ã§ä¸è¦ãid:fgshunããã¯Taggerã«mecabrcãã¡ã¤ã«ãæå®ãã¦ã¾ãããæå®ããªãã¨ããã©ã«ãã§C:\Program Files\MeCab\etc\mecabrcãèªã¿ã«ãããããªã®ã§mecab-0.98.exeã§IPAè¾æ¸ãã¤ã³ã¹ãã¼ã«ããã»ããç°¡åã ã¨æãã¾ããä¸ã®ãµã³ãã«ããã°ã©ã ãåãã確ããã¦ã¿ã¾ãã
å½¢æ ç´ è§£æ
#coding:utf-8 import MeCab tagger = MeCab.Tagger("-Ochasen") result = tagger.parse("PythonããMeCabã®å½¢æ ç´ è§£ææ©è½ã使ã£ã¦ã¿ã¾ããã") print result
parse()ã®å¼æ°ã«è§£æãããæååãå ¥åããã¨è§£æçµæãããã¹ãã§è¿ã£ã¦ãã¾ããæååã¯u""ã®ã¦ãã³ã¼ãæååã§ãªãã¦ããã¿ããã§ããéã«ã¦ãã³ã¼ãæååã¯str()ã§å¤æããªãã¨ãã¡ã¿ãããä¸ã®ä¾ã®å®è¡çµæã¯ä¸ã®ããã«ãªãã¾ãããã ãããã¹ãã§çµæãè¿ããã¦ãå¿ è¦ãªæ å ±ã®æ½åºãªã©ãã®å¾ã®å¦çãé¢åã§ãããã®å ´åã¯ãå¾ã§ç´¹ä»ããparseToNode()ã¨ããã¡ã½ãããç¨æããã¦ãã¾ãã
Python Python Python åè©-åºæåè©-çµç¹ ãã ã«ã© ãã å©è©-æ ¼å©è©-ä¸è¬ MeCab MeCab MeCab åè©-åºæåè©-çµç¹ ã® ã ã® å©è©-é£ä½å å½¢æ ç´ ã±ã¤ã¿ã¤ã½ å½¢æ ç´ åè©-ä¸è¬ 解æ ã«ã¤ã»ã 解æ åè©-ãµå¤æ¥ç¶ æ©è½ ãã㦠æ©è½ åè©-ãµå¤æ¥ç¶ ã ã² ã å©è©-æ ¼å©è©-ä¸è¬ 使㣠ãã«ã 使ã åè©-èªç« äºæ®µã»ã¯è¡ä¿é³ä¾¿ é£ç¨ã¿æ¥ç¶ 㦠ã 㦠å©è©-æ¥ç¶å©è© ã¿ ã ã¿ã åè©-éèªç« ä¸æ®µ é£ç¨å½¢ ã¾ã ãã· ã¾ã å©åè© ç¹æ®ã»ãã¹ é£ç¨å½¢ ã ã¿ ã å©åè© ç¹æ®ã»ã¿ åºæ¬å½¢ ã ã ã è¨å·-å¥ç¹ EOS
åãã¡æ¸ã
-Ochasenã®ã¨ããã-Owakatiã«ããã¨åãã¡æ¸ãã®çµæãè¿ã£ã¦ãã¾ã
#coding:utf-8 import MeCab tagger = MeCab.Tagger("-Owakati") result = tagger.parse("PythonããMeCabã®å½¢æ ç´ è§£ææ©è½ã使ã£ã¦ã¿ã¾ããã") print result
Python ãã MeCab ã® å½¢æ ç´ è§£æ æ©è½ ã 使㣠㦠㿠ã¾ã ã ã
èªã¿
-Ochasenã®ã¨ããã-Oyomiã«ããã¨èªã¿ãè¿ã£ã¦ãã¾ããä½ã«ä½¿ãã®ããã¾ãã¡ãã³ã¨ããªãã®ã§ãããé³å£°èªèã¨ãé³å£°åæã¨ãããªï¼
#coding:utf-8 import MeCab tagger = MeCab.Tagger("-Oyomi") result = tagger.parse("PythonããMeCabã®å½¢æ ç´ è§£ææ©è½ã使ã£ã¦ã¿ã¾ããã") print result
Pythonã«ã©MeCabãã±ã¤ã¿ã¤ã½ã«ã¤ã»ãããã¦ã²ãã«ããããã·ã¿ã
詳細æ å ±ã®åå¾
parse()ã®ä»£ããã«parseToNode()ã使ãã¨å½¢æ ç´ ã®è©³ç´°æ å ±ãå¾ããã¾ããparseToNode()ã¯å é ã®ãã¼ãï¼å½¢æ ç´ æ å ±ï¼ãè¿ããsurfaceã§è¡¨å±¤å½¢ãfeatureã§å½¢æ ç´ æ å ±ãåå¾ã§ãã¾ãã両æ¹ã¨ãæååã§ããfeature㯠, ã§åºåããã¦ããã®ã§split()ãªã©ã§åå²ãã¦å¿ è¦ãªæ å ±ãæ½åºãã¾ãã
#coding:utf-8 import MeCab tagger = MeCab.Tagger("-Ochasen") node = tagger.parseToNode("PythonããMeCabã®å½¢æ ç´ è§£ææ©è½ã使ã£ã¦ã¿ã¾ããã") while node: print "%s %s" % (node.surface, node.feature) node = node.next
å®è¡çµæã¯ã
BOS/EOS,*,*,*,*,*,*,*,* Python åè©,åºæåè©,çµç¹,*,*,*,* ãã å©è©,æ ¼å©è©,ä¸è¬,*,*,*,ãã,ã«ã©,ã«ã© MeCab åè©,åºæåè©,çµç¹,*,*,*,* ã® å©è©,é£ä½å,*,*,*,*,ã®,ã,ã å½¢æ ç´ åè©,ä¸è¬,*,*,*,*,å½¢æ ç´ ,ã±ã¤ã¿ã¤ã½,ã±ã¤ã¿ã¤ã½ 解æ åè©,ãµå¤æ¥ç¶,*,*,*,*,解æ,ã«ã¤ã»ã,ã«ã¤ã»ã æ©è½ åè©,ãµå¤æ¥ç¶,*,*,*,*,æ©è½,ããã¦,ããã¼ ã å©è©,æ ¼å©è©,ä¸è¬,*,*,*,ã,ã²,㲠使㣠åè©,èªç«,*,*,äºæ®µã»ã¯è¡ä¿é³ä¾¿,é£ç¨ã¿æ¥ç¶,使ã,ãã«ã,ãã«ã 㦠å©è©,æ¥ç¶å©è©,*,*,*,*,ã¦,ã,ã ã¿ åè©,éèªç«,*,*,ä¸æ®µ,é£ç¨å½¢,ã¿ã,ã,ã ã¾ã å©åè©,*,*,*,ç¹æ®ã»ãã¹,é£ç¨å½¢,ã¾ã,ãã·,ãã· ã å©åè©,*,*,*,ç¹æ®ã»ã¿,åºæ¬å½¢,ã,ã¿,ã¿ ã è¨å·,å¥ç¹,*,*,*,*,ã,ã,ã BOS/EOS,*,*,*,*,*,*,*,*
æåã¨æå¾ã®BOS/EOSãåºåãããã¿ããã§ããã¾ããç¡è¦ããã°ããã§ãããfeatureã®ãã©ã¼ãããã¯ã
åè©,åè©ç´°åé¡1,åè©ç´°åé¡2,åè©ç´°åé¡3,æ´»ç¨å½¢,æ´»ç¨å,åå½¢,èªã¿,çºé³
ã¨ã®ãã¨ãçºé³ã¨ãä»ãã¦ããã©é³å£°åæã«ã使ããã®ããªï¼
æç« ããåè©ã®ã¿æ½åº
ææ¸åé¡ãªã©ã§ã¯ææ¸ãåèªã®éåã§è¡¨ããã¨ãããããã¾ãï¼bag-of-wordsã¢ãã«ï¼ããã®ã¨ãã«ä½¿ãåèªã¯ä¸»ã«åè©ã§ããå©è©ãå©åè©ãªããã¯æç« ã®å 容ã表ããªãã®ã§æ®éé¤å»ãã¾ããã¾ããåè©ã形容è©ã形容åè©ã¯ç®çã«ãã£ã¦ã¯ä½¿ãã¾ããæ®éã¯ãã¾ã使ããªãã¨æãã¾ããããã§ãæç« ãä¸ããã¨åè©ã®ãªã¹ããè¿ããããªé¢æ°ãä½æãã¦ã¿ã¾ãã
#coding:utf-8 import MeCab def extractKeyword(text): """textãå½¢æ ç´ è§£æãã¦ãåè©ã®ã¿ã®ãªã¹ããè¿ã""" tagger = MeCab.Tagger('-Ochasen') node = tagger.parseToNode(text.encode('utf-8')) keywords = [] while node: if node.feature.split(",")[0] == u"åè©": keywords.append(node.surface) node = node.next return keywords if __name__ == "__main__": keywords = extractKeyword(u"PythonããMeCabã®å½¢æ ç´ è§£ææ©è½ã使ã£ã¦ã¿ã¾ããã") for w in keywords: print w, print keywords = extractKeyword(u"è ç´äººé¦ç¸ã¯éå ã®åºæ¹ãä¸è«ãè¦æ¥µãã¤ã¤å¤æããèãã ã") for w in keywords: print w,
å®è¡çµæã¯ã
Python MeCab å½¢æ ç´ è§£æ æ©è½ è ç´äºº é¦ç¸ éå åºæ¹ ä¸è« å¤æ èã
å½¢æ ç´ è§£æã¯æ½åºããããã¼ã¯ã¼ãã®ç²åº¦ãããªãç´°ããã§ãããã£ã¨å¤§ããªç²åº¦ã§è¤åèªã¨ãã¦æ½åºããããã¨ãããããã¾ãããã¨ãã°ãä¸ã®ä¾ã§ã¯ããå½¢æ ç´ è§£æãããè ç´äººãã§1ã¤ã®ãã¼ã¯ã¼ãã«ãã¦ã»ãããªãããã®åé¡ã«å¯¾ãã¦ããã¨ã§ãWikipediaã³ã¼ãã¹ã使ã£ãã¢ããã¼ãã試ãã¦ã¿ã¾ãã