²òÀÏ

wikipedia¤Î¥Ç¡¼¥¿¤ä´éʸ»ú¼­½ñ¤«¤émecab¤Î¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë¥Õ¥ì¡¼¥à¥ï¡¼¥¯

¥«¥Æ¥´¥ê
¥Ö¥Ã¥¯¥Þ¡¼¥¯¿ô
¤³¤Î¥¨¥ó¥È¥ê¡¼¤ò´Þ¤à¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯ ¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯ - wikipedia¤Î¥Ç¡¼¥¿¤ä´éʸ»ú¼­½ñ¤«¤émecab¤Î¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë¥Õ¥ì¡¼¥à¥ï¡¼¥¯
¤³¤Î¥¨¥ó¥È¥ê¡¼¤ò¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯¤ËÄɲÃ

ÆÍÁ³¤Ç¤¹¤¬¡¤mecab¤Î¼­½ñ (mecab-ipadic) ¤ò¥Ç¥Õ¥©¥ë¥È¤Î¤Þ¤Þ»È¤Ã¤Æ¡¤mecab°Õ³°¤È»È¤¨¤Í¤§¤È¤«Ê¸¶ç¸À¤Ã¤Æ¤ë°­¤¤»Ò¤Ï¤ª¤é¤ó¤«¤Í¡©

mecab-ipadic ¤ÏÈæ³ÓŪ¤ª¹Ôµ·¤Î¤è¤¤ÆüËܸì¤ò¥Ù¡¼¥¹¤Ëºî¤é¤ì¤Æ¤¤¤ë¤Î¤Ç¡¤¤½¤Î¤Þ¤Þ¤Ç¤Ï web¾å¤Î¸ý¸ìʸÂΤΥƥ­¥¹¥È¤Ï¤¦¤Þ¤¯°·¤¨¤Ê¤¤¤³¤È¤¬¤¢¤ê¤Þ¤¹¡£ËÜÍè¤Ï¶µ»Õ¥Ç¡¼¥¿¤òÍѰդ·¡¤³Ø½¬¤µ¤»¤ë¤È¤¤¤Ã¤¿¼êË¡¤ò»È¤¦¤Î¤¬Àµ¹¶Ë¡¤À¤È»×¤¤¤Þ¤¹¤¬¡¤¤È¤ê¤¢¤¨¤ºÌ¾»ì¤ò½¼¼Â¤µ¤»¤ë¤À¤±¤Ç¤â¼ÂÍÑÅ٤ϤÀ¤¤¤Ö¾å¤¬¤ë¤Ç¤·¤ç¤¦¡£

¿Í´Ö¤ÎÏ乸À¸ì¤Ë¤Ï¡¤Æ°»ì¤Î¸ì´´¤ä̾»ì¤Ë¤ÏÆü¡¹¿·¤·¤¯¸ì×ä¬Áý¤¨¤ë¤±¤É¡¤½õ»ì¤ä³èÍѤΥ롼¥ë¤Ï´Êñ¤Ë¤ÏÊѲ½¤·¤Ê¤¤¡¤¤È¤¤¤¦ÆÃÀ­¤¬¤¢¤ê¤Þ¤¹¡£ÆÃ¤Ë¡Ö¤¤¤ÞºÇ¤â¤Ä¤Ö¤ä¤«¤ì¤Æ¤¤¤ëñ¸ì¥é¥ó¥­¥ó¥°¡×¤È¤¤¤Ã¤¿½¸·×¤ò¤¹¤ë¤è¤¦¤Ê¾ì¹ç¤Ï¡¤Ì¾»ì¤ÎÈϰϤÎÀÚ¤ê½Ð¤·¤µ¤¨´Ö°ã¤¨¤Ê¤±¤ì¤Ð¤½¤ì¤Ê¤ê¤Î·ë²Ì¤ò½Ð¤»¤ë¤³¤È¤â¿¤¤¤Î¤Ç¤¹¡£

¤¿¤À¡¤¼­½ñ¤Ø¤Îñ¸ìÄɲäϤ³¤³¤Ë¤¢¤ëÄ̤ê´Êñ¤Ë¤Ç¤­¤ë¤Î¤Ç¤¹¤¬¡¤Ã±¸ì¤ÎÀ¸µ¯¥³¥¹¥È¤ò·è¤á¤ëÉôʬ¤Çíµ¤¤¤Æ¤·¤Þ¤¦¤³¤È¤â¿¤¤¤È»×¤¤¤Þ¤¹¡£

¤½¤³¤Ç¡¤¤¦¤Á¤Ç°ÊÁ°¤«¤é»È¤Ã¤Æ¤¤¤¿ mecab ¤Î¼­½ñÁý¶¯ÍѤΥե졼¥à¥ï¡¼¥¯¤ò¸ø³«¤¹¤ë¤³¤È¤Ë¤·¤Þ¤·¤¿¡£wikipedia ¤Î¥Ç¡¼¥¿¤ä´éʸ»ú¼­½ñ¤Ê¤É¤«¤é¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

mecab-dic-overdrive

https://github.com/nabokov/mecab-dic-overdrive

GenDic.pm ¤Î¥µ¥Ö¥¯¥é¥¹¤òºîÀ®¤¹¤ë¤³¤È¤Ç¡¤¤µ¤Þ¤¶¤Þ¤Ê·Á¼°¤ÎÆþÎϥǡ¼¥¿¤«¤éñ¸ì¤òÆÉ¤ß¼è¤ê¡¤(¤½¤ì¤Ê¤ê¤Ë)ŬÀÚ¤ÊÀ¸µ¯¥³¥¹¥È¤ò¼«Æ°Åª¤Ë¿ä¬¤·¤Æ¥æ¡¼¥¶¼­½ñ¥Õ¥¡¥¤¥ë¤òÀ¸À®¤·¤Æ¤¯¤ì¤ë»ÅÁȤߤˤʤäƤ¤¤Þ¤¹¡£¥Ç¥Õ¥©¥ë¥È¤Ç¤Ï wikipedia ÆüËܸìÈǤΠjawiki-latest-page.sql.gz ¤È´éʸ»ú¼­½ñÍѤÎtsv¤È¤«¤é¡¤¤½¤ì¤¾¤ì¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

»÷¤¿¤è¤¦¤Ê¥¹¥¯¥ê¥×¥È¤äµ­»ö¤¬¤¹¤Ç¤Ë¤¤¤¯¤Ä¤«¸ø³«¤µ¤ì¤Æ¤¤¤ë¤Î¤Ç¤¢¤¨¤Æ¸ø³«¤¹¤ë¤³¤È¤â¤Ê¤¤¤«¤Ê¤È»×¤Ã¤Æ¤¤¤¿¤Î¤Ç¤¹¤¬¡¤¸å¤Ç½Ò¤Ù¤ë¤è¤¦¤Ë¡¤À¸µ¯¥³¥¹¥È¤Î·×»»ÊýË¡¤ä¡¤¥Î¡¼¥Þ¥é¥¤¥¼¡¼¥·¥ç¥ó¤Þ¤Ç´Þ¤á¤¿¼­½ñ´ÉÍý¤Ë¿¾¯¤ÎÆÈ¼«À­¤È¤¤¤¦¤«¸ø³«¤¹¤ë°ÕµÁ¤¬¤¢¤ëµ¤¤¬¤·¤Þ¤·¤¿¤Î¤Ç¡£²¿¤«¤Î»²¹Í¤Ë¤Ê¤ì¤Ð¹¬¤¤¤Ç¤¹¡£

mecab-dic-overdrive¤Îµ¡Ç½

¼­½ñ¤Îutf-8²½

mecab¤ò»È¤¦¤Î¤Ëipadic¼«ÂΤòutf-8²½¤¹¤ëɬÍפÏɬ¤º¤·¤â¤Ê¤¤¤Î¤Ç¤¹¤¬¡¤¼¡¤Ë½Ò¤Ù¤ë¼­½ñ¥Ñ¥Ã¥Á¤òºî¤ë¾ì¹ç¤ä¡¤³Æ¼ï¥×¥í¥°¥é¥à¤«¤é»²¾È¤¹¤ë¾ì¹ç¤Ê¤É¤Ë¤Ï utf-8 ¤ÎÊý¤¬ÊØÍø¤Ê¤Î¤Ç¡¤ºÇ½é¤Ëʸ»ú¥³¡¼¥É¤ÎÊÑ´¹¤ò¤·¤Þ¤¹¡£

¼­½ñ¤Ø¤Î¥Ñ¥Ã¥ÁŬÍÑ

misc/dic/*.patch ¤Ë¡¤ipadic ¤ËÂФ¹¤ë¥Ñ¥Ã¥Á¤¬¤¤¤¯¤Ä¤«ÍѰդ·¤Æ¤¢¤ê¤Þ¤¹¡£"A" "B" ¤Ê¤É¤Î±Ñ¿ô»ú¤¬Ã±ÆÈ¤ÇÀÚ¤ê½Ð¤µ¤ì¤Ë¤¯¤¯¤Ê¤ë¤¿¤á¤ÎÊѹ¹¤ä¡¤"¤î" "¤ç" ¤Ê¤É¤¬½õ»ì¤È¤·¤ÆÇ§¼±¤µ¤ì¤ë¤è¤¦¤Ë¤Ê¤ë¤¿¤á¤Î¥Ñ¥Ã¥Á¤¬´Þ¤Þ¤ì¤Þ¤¹¡£¤³¤Î¾¤Ë¤â¼«Á°¤Ç²¿¤«Êѹ¹¤ò²Ã¤¨¤¿¤¤¾ì¹ç¤Ï *.patch ¥Õ¥¡¥¤¥ë¤ò (utf-8¤Ç) ½ñ¤¤¤Æ¤³¤³¤ËÃÖ¤¤¤Æ¤ª¤¯¤È¼«Æ°Åª¤ËŬÍѤµ¤ì¤Þ¤¹¡£

¼­½ñ¤Î¥Î¡¼¥Þ¥é¥¤¥º

¼­½ñ¤òÍ­¸ú³èÍѤ¹¤ë¤¿¤á¤Ë¤Ï¡¤

¤Ê¤É¡¤¤µ¤Þ¤¶¤Þ¤Ê¼êË¡¤ò¶î»È¤·¤ÆÉ½¸½Íɤì¤òµÛ¼ý¤·¤Æ¤ª¤¯É¬Íפ¬¤¢¤ê¤Þ¤¹¡£¼­½ñºîÀ®»þ¤Èʸ¾Ï²òÀÏ»þ¤ÎξÊý¤ÇƱ¤¸¥Î¡¼¥Þ¥é¥¤¥¼¡¼¥·¥ç¥ó¤òŬÍѤ¹¤ë¤Î¤â½ÅÍפÊÃí°ÕÅÀ¤Ç¤¹¡£

¥Ç¥Õ¥©¥ë¥È¤Ç¤Ï°Ê²¼¤Î¥Î¡¼¥Þ¥é¥¤¥º½èÍý¤¬¤³¤ÎÄ̤ê¤Î½ç¤ÇŬÍѤµ¤ì¤Þ¤¹¡£NFKC¤Èlc°Ê³°¤Ï¥Ð¥Ã¥É¥Î¥¦¥Ï¥¦¤Î²ô¤Ç¤¹¡£²þ¹Ô¤Î°·¤¤¤Ê¤É¤Ï¼­½ñºîÀ®»þ¤Ë¤Ï̵³²¤Ç¤¹¤¬¡¤ÆÃ¤Ë´éʸ»ú¤äµ­¹æ¤ò´Þ¤à¥Æ¥­¥¹¥È¤ËÂ礭¤¯±Æ¶Á¤¹¤ëÀßÄê¤â´Þ¤Þ¤ì¤ë¤Î¤Ç¡¤É¬¤º¡¤²òÀÏ»þ¤Ë»È¤¦Àµµ¬²½¤ÈƱ¤¸¤â¤Î¤òÀßÄꤹ¤ë¤è¤¦¤Ë¤·¤Æ¤¯¤À¤µ¤¤¡£

  1. decode_entities : HTML¥¨¥ó¥Æ¥£¥Æ¥£¤ò¥æ¥Ë¥³¡¼¥Éʸ»ú¤Ë¥Ç¥³¡¼¥É [ ♥ ¢ª ♥ ]
  2. strip_single_nl : ñÆÈ¤Î²þ¹Ô¤ò½üµî (Æó¤Ä°Ê¾åϢ³¤¹¤ë²þ¹Ô¤Ï¶èÀÚ¤ê¤È¸«¤Ê¤¹)
  3. wavetilde2long : ÇÈ¥À¥Ã¥·¥å¤òĹ²»µ­¹æ¤ËÃÖ¤­´¹¤¨¤ë [ ¥×¡Á ¢ª ¥×¡¼ ]
  4. fullminus2long : Á´³Ñ¥Þ¥¤¥Ê¥¹µ­¹æ¤òĹ²»µ­¹æ¤ËÃÖ¤­´¹¤¨¤ë [ ¥×¡Ý ¢ª ¥×¡¼ ]
  5. dashes2long : ¥À¥Ã¥·¥åÁ´È̤òĹ²»µ­¹æ¤ËÃÖ¤­´¹¤¨¤ë [ ¥×— ¢ª ¥×¡¼ ]
  6. drawing_lines2long : ·ÓÀþ¤Ë»È¤ï¤ì¤ë²£Àþ¤Ê¤É¤òĹ²»µ­¹æ¤ËÃÖ¤­´¹¤¨¤ë (»²¹Í:[1] [2]) [ ¥×¨¡ ¢ª ¥×¡¼ ]
  7. unify_long_repeats : Ϣ³¤¹¤ëĹ²»µ­¹æ¤òĹ²»µ­¹æ°ì¸Ä¤ËÃÖ¤­´¹¤¨¤ë [ ¥×¡¼¡¼¡¼ ¢ª ¥×¡¼ ]
  8. nfkc : NFKCÀµµ¬²½ [ ¥Õ¡¬ŽÌŽÞ¢ª ¥×¥× ]
  9. lc : ¥¢¥ë¥Õ¥¡¥Ù¥Ã¥È¤ò¾®Ê¸»ú¤ËÅý°ì [ ABC ¢ª abc ]

Êѹ¹¤·¤¿¤¤¾ì¹ç¤Ï lib/MecabTrainer/NormalizeText.pm ¤ò»²¾È¤Î¾å¡¤etc/config.pl ¤ÎÆâÍÆ¤òÊÔ½¸¤·¤Þ¤¹¡£bin/normalize_text.pl ¤ò»È¤Ã¤Æ¥Î¡¼¥Þ¥é¥¤¥¼¡¼¥·¥ç¥ó¤Î·ë²Ì¤ò³Îǧ¤¹¤ë¤³¤È¤â¤Ç¤­¤Þ¤¹¡£

>bin/normalize_text.pl
Ž·ŽÀ¨¬¨¬¨¬¨¬¨¬¨¬(Žß¢ÏŽß)¨¬¨¬¨¬¨¬¨¬¨¬ !!!!!
¥­¥¿¡¼(゚¢Ï゚)¡¼ !!!!!

>bin/normalize_text.pl --normalize_opts=decode_entities,nfkc
㍖ ½
¥ì¥ó¥È¥²¥ó 1⁄2

ñ¸ìÀ¸µ¯¥³¥¹¥È¤Î¼«Æ°³ä¤êÅö¤Æ

¿·¤·¤¯Ã±¸ì¤òÅÐÏ¿¤¹¤ë¾ì¹ç¤ËÌäÂê¤Ë¤Ê¤ë¤Î¤¬¡¤¾å¤Ç½Ò¤Ù¤¿Ã±¸ìÀ¸µ¯¥³¥¹¥È¤Î»»½Ð¤Ç¤¹¡£¤³¤³¤Ç"É¡¥»¥ì¥Ö" ¤È¤¤¤¦¾¦ÉÊ̾¤òÎã¤Ë¡¤Ã±¸ìÀ¸µ¯¥³¥¹¥È¤ÎÄ´À°¤Î¤·¤«¤¿¤ò¹Í¤¨¤Æ¤ß¤Þ¤·¤ç¤¦¡£

É¡¥»¥ì¥Ö¥¿¥ï¡¼
É¡¥»¥ì¥Ö(¥¦¥µ¥®¸ÂÄê)¤Ð¤«¤êÇã¤Ã¤Æ¤ë¿Í¤ÎÎã

ñ¸ì¤¬Ã±ÂΤǸ½¤ì¤¿¾ì¹ç¤Ë¡¤Ê¬³ä¤µ¤ì¤Ê¤¤¤®¤ê¤®¤ê¤Î¥é¥¤¥ó¤òµá¤á¤ëÊýË¡

ÁǤμ­½ñ¤Ç"É¡¥»¥ì¥Ö"¤À¤±¤«¤é¤Ê¤ëʸ¤ò mecab ¤Ç²òÀϤ¹¤ë¤È°Ê²¼¤Î¤è¤¦¤Ë¡ÖÉ¡¡×¤È¡Ö¥»¥ì¥Ö¡×¤¬ÊÌ¡¹¤Îñ¸ì¤È¤·¤ÆÇ§¼±¤µ¤ì¤Æ¤·¤Þ¤¤¤Þ¤¹¡£

·ÁÂÖÁÇ Ï¢ÀÜ¥³¥¹¥È ñ¸ìÀ¸µ¯¥³¥¹¥È ÎßÀÑ¥³¥¹¥È
BOS - 0 0
-283 - -283
É¡(̾»ì/°ìÈÌ) - 6033 5750
62 - 5812
¥»¥ì¥Ö(̾»ì/°ìÈÌ) - 9461 15273
-573 - 14700
EOS - 0 14700

(BOS¤ÏʸƬ¡¤EOS¤Ïʸ¤Î½ª¤ï¤ê¤òɽ¤·¤Þ¤¹¡£)

¤½¤³¤Ç¡¤Ã±¸ì¡ÖÉ¡¥»¥ì¥Ö¡×¤¬Ã±ÂΤÎʸ¾Ï¤È¤·¤Æ¸½¤ì¤¿¾ì¹ç¤Ë¡¤¤½¤ì°Ê¾åʬ³ä¤µ¤ì¤Ê¤¤¤è¤¦¤Ë¤¹¤ë¤³¤È¤òÌÜɸ¤È¤·¤Æ¤ß¤Þ¤¹¡£

¤Þ¤º¼­½ñ¤Ë·ÁÂÖÁÇ¡ÖÉ¡¥»¥ì¥Ö(¸Çͭ̾»ì/°ìÈÌ)¡×¤òÄɲä·¤Þ¤¹¡£¤½¤·¤Æ¡¤mecab ¤¬¡Ö¡ØÉ¡+¥»¥ì¥Ö¡Ù¤Ëʬ²ò¤¹¤ë¤è¤ê¡ØÉ¡¥»¥ì¥Ö¡ÙñÂΤȤ·¤¿Êý¤¬¥È¡¼¥¿¥ë¥³¥¹¥È¤¬Ä㤤¡×¤ÈȽÃǤ¹¤ë¤è¤¦¤Ëñ¸ìÀ¸µ¯¥³¥¹¥È¤òÄ´À᤹¤ë¤³¤È¤ò¹Í¤¨¤Þ¤¹¡£

¤Ä¤Þ¤ê¡¤

·ÁÂÖÁÇ Ï¢ÀÜ¥³¥¹¥È ñ¸ìÀ¸µ¯¥³¥¹¥È ÎßÀÑ¥³¥¹¥È
BOS - 0 0
-310 - -310
É¡¥»¥ì¥Ö(¸Çͭ̾»ì/°ìÈÌ) - *1 *
-919 - *
EOS - 0 *2

¾åɽ¤Î *1 ¤ò²¿¤Ë¤¹¤ì¤Ð *2 ¤¬ 14700 °Ê²¼¤Ë¤Ê¤ë¤«¡© ¤È¤¤¤¦·êËä¤áÌäÂê¤ò²ò¤¯¤³¤È¤Ë¤Ê¤ë¤ï¤±¤Ç¤¹¡£¤³¤Î¾ì¹ç¤Ï *1 ¤ò 15928 °Ê²¼¤Ë¤¹¤ì¤Ð¡¤Á´ÂΤΥ³¥¹¥È¤¬¡ÖÉ¡+¥»¥ì¥Ö¡×¤Î14700¤è¤ê¤âÄ㤯¤Ê¤ê¤Þ¤¹¡£

BlogPaint

¢¨1¡ÖÌÀÆü¤ÎÉ¡¥»¥ì¥Öº×¤ê¤ÏÃæ»ß¤Ç¤¹¡×¤Î¤è¤¦¤ËÁ°¸å¤Ë¾¤Î·ÁÂÖÁǤ¬¤Ä¤Ê¤¬¤ë¾ì¹ç¤Ï¡¤Á°¸å¤ÎÏ¢ÀÜ¥³¥¹¥È¤¬ÊѤï¤Ã¤Æ¤­¤Þ¤¹¡£¡ÖñÂΤÎʸ¾Ï¤È¤·¤Æ(BOS¤ÈEOS¤Î´Ö¤Ë)¸½¤ì¤¿¾ì¹ç¤Ëʬ³ä¤µ¤ì¤Ê¤¤¤è¤¦¤Ë¤¹¤ë¡×¤È¤¤¤¦¥ë¡¼¥ë¤Ï¤¢¤¯¤Þ¤Ç¤â×ó°ÕŪ¤Ê´ð½à¤Ë¤¹¤®¤Þ¤»¤ó¡£

¢¨2 ¤È¤­¤É¤­¤³¤³¤Ë¤¢¤ëAuto Link¤ÎÎã¤Ë½¾¤Ã¤Æ¡¤cost = (int)max(-36000, -400 * (length^1.5)) ¤È¤¤¤¦¼°¤ò¤½¤Î¤Þ¤Þ»È¤Ã¤Æ¤¤¤ëµ­»ö¤ò¸«¤«¤±¤Þ¤¹¤¬¡¤¤³¤Î¼°¤Ï¤¢¤¯¤Þ¤Ç¤³¤Î¼­½ñ¤À¤±¤ò»È¤Ã¤Æ mecab ¤ò AutoLink ÀìÍѤËÍѤ¤¤ë¾ì¹ç ¤òÁÛÄꤷ¤Æ½ñ¤«¤ì¤¿¤â¤Î¤Ç¡¤¤³¤ì¤ò ipadic ¤Èº®¤¼¤ë¤È´ð½àÃͤ¬¹ç¤ï¤Ê¤¯¤Ê¤ë¤È»×¤¤¤Þ¤¹¡£ipadic¤Ë¤¢¤ëÀ¸µ¯¥³¥¹¥È¤Ï»Í·å¤°¤é¤¤¤Þ¤Ç¤ÎÀµ¤Î¿ô¤Ç¤¹¤¬¡¤¤³¤Î¼°¤À¤È¥³¥¹¥È¤¬¥Þ¥¤¥Ê¥¹¤Ë¤Ê¤ë¤Î¤Ç¡¤Ê¸Ì®¤Ë´Ø¤ï¤é¤º¤Û¤Ü¾ï¤Ë¥æ¡¼¥¶¼­½ñ¤Î¥¨¥ó¥È¥ê¤¬Í¥À褵¤ì¤ë¤Ç¤·¤ç¤¦¡£(ÌÞÏÀ¤½¤¦¤¤¤¦°Õ¿Þ¤Ê¤é¤½¤ì¤Ç¹½¤ï¤Ê¤¤¤Î¤Ç¤¹¤¬¡£)

´û¸¼­½ñ¤«¤é¡¤Æ±¤¸ÉÊ»ì&Ʊ¤¸Ä¹¤µ¤Î·ÁÂÖÁǤÎÊ¿¶Ñ¥³¥¹¥È¤ò·×»»¤·¤Æ¤ª¤¯ÊýË¡

¾å¤È¤ÏÊ̤ˡ¤¤â¤¦¾¯¤·Ã±½ã¤ËÀ¸µ¯¥³¥¹¥È¤ÎÌܰ¤òÆÀ¤ëÊýË¡¤â¤¢¤ê¤Þ¤¹¡£

Î㤨¤Ð´û¸¤Îipadic¤ÎÃæ¤«¤é¡Ö¸Çͭ̾»ì/°ìÈ̡פÎñ¸ì¤À¤±¤ò¼è¤ê½Ð¤·¡¤Ã±¸ì¤ÎŤµ¤´¤È¤ËÀ¸µ¯¥³¥¹¥È¤ÎÊ¿¶Ñ¤ò¤È¤Ã¤Æ¤ª¤­¤Þ¤¹¡£

ʸ»ú¿ô Ê¿¶Ñ¥³¥¹¥È
1 8998
2 8242
3 8339
4 7989
5 6947
... ...
10 5038
... ...

¤³¤Î¥Æ¡¼¥Ö¥ë¤ò¤¢¤é¤«¤¸¤á¤Ä¤¯¤Ã¤Æ¤ª¤­¡¤¿·¤¿¤Êñ¸ì¤òÅÐÏ¿¤¹¤ëºÝ¤Ï¡¤Æ±¤¸ÉÊ»ì&Ʊ¤¸Ä¹¤µ¤Î´û¸¤Î·ÁÂÖÁǤÎÊ¿¶ÑÃͤò¤¢¤Æ¤Ï¤á¤ë¤è¤¦¤Ë¤¹¤ë¤ï¤±¤Ç¤¹¡£"É¡¥»¥ì¥Ö"¤Î¾ì¹ç¤Ï4ʸ»ú¤Ê¤Î¤ÇÀ¸µ¯¥³¥¹¥È¤È¤·¤Æ7989¤òºÎÍѤ¹¤ë¤³¤È¤Ë¤Ê¤ê¤Þ¤¹¡£¤Þ¤¢¡¤Â绨ÇĤǤϤ¢¤ê¤Þ¤¹¤¬²¿¤â¤·¤Ê¤¤¤è¤ê¤Ï¤À¤¤¤Ö¥Þ¥·¤Ê´¶¤¸¤Ë¤Ê¤ë¤È»×¤¤¤Þ¤¹¡£

mecab-dic-overdrive ¤Î¥³¥¹¥ÈÀ¸À®Êý¼°

mecab-dic-overdrive ¤Ç¤Ï¡¤¤³¤ÎÆó¤Ä¤ÎÊý¼°¤òÁȤ߹ç¤ï¤»¤Æ¥³¥¹¥È·èÄê¤ò¹Ô¤¤¤Þ¤¹¡£¥Ç¥Õ¥©¥ë¥È¤Îưºî¤Ï

  1. Ʊ¤¸ÉÊ»ì&Ʊ¤¸Ä¹¤µ¤Î´û¸ñ¸ì¤ÎÊ¿¶Ñ¥³¥¹¥È (¢¨¾ò·ï¤òËþ¤¿¤¹´û¸ñ¸ì¤¬¸«¤Ä¤«¤é¤Ê¤¤¾ì¹ç¤Ï¤¢¤é¤«¤¸¤á·è¤á¤¿¸ÇÄêÃͤòÍøÍÑ)
  2. ¾å¤Ç¼¨¤·¤¿¡ÖñÆÈ¤Ç¸½¤ì¤¿¾ì¹ç¤Ë¤½¤ì°Ê¾åºÙʬ³ä¤µ¤ì¤Ê¤¤¤®¤ê¤®¤ê¤Î¥³¥¹¥È¡×x 0.7

¤Î¡¤¤É¤Á¤é¤«¾®¤µ¤¤Êý¤ò¤È¤ë¤è¤¦¤Ë¤Ê¤Ã¤Æ¤¤¤Þ¤¹¡£(¤³¤Îưºî¤Ï GenDic.pm ¤Î200¹ÔÌܤ«¤é¤Î¤¢¤¿¤ê¤òÊÔ½¸¤¹¤ì¤Ð¥«¥¹¥¿¥Þ¥¤¥º²Äǽ¤Ç¤¹¡£)

Á°¼Ô¤Î·×»»¤Ë¤Ï¼­½ñ¤Î¸µ¤Îcsv¥Õ¥¡¥¤¥ë¡¤¸å¼Ô¤Î·×»»¤Ë¤Ï left-id.def, right-id.def, matrix.def ¤ò»²¾È¤¹¤ë¤¿¤á¡¤mecab-ipadic ¤Î¥½¡¼¥¹¤Î¾ì½ê¤ò config ¤ËÀßÄꤷ¤Æ¤ä¤ëɬÍפ¬¤¢¤ê¤Þ¤¹¡£

mecab-dic-overdrive »ÈÍÑÊýË¡

¼­½ñ¤Î¥¤¥ó¥¹¥È¡¼¥ë & ¥æ¡¼¥¶¼­½ñºîÀ®

(1) »öÁ°¤ËɬÍפʥ饤¥Ö¥é¥êÅù¤Î½àÈ÷
  • ¤¢¤é¤«¤¸¤á mecabËÜÂÎ, ¤ª¤è¤Ó¡¤°Ê²¼¤Îperl¥é¥¤¥Ö¥é¥ê¤ò¥¤¥ó¥¹¥È¡¼¥ë¤·¤Æ¤ª¤¯
    • Text::MeCab
    • Unicode::Normalize
    • Unicode::RecursiveDowngrade
    • HTML::Entities
    • File::Spec
    • Path::Class
    • Log::Log4perl
  • mecab-dic-overdriveËÜÂΤògithub¤«¤éÆþ¼ê¤¹¤ë
  • mecab-ipadic-2.7.0-20070801 ¤ò¥À¥¦¥ó¥í¡¼¥É¤·¡¤²òÅष¤Æ¤ª¤¯¡£(¾¤Î¥Ð¡¼¥¸¥ç¥ó¤Î¾ì¹ç¡¤Á°½Ò¤Î¥Ñ¥Ã¥Á¤ÎÃʳ¬¤Ê¤É¤Ç¤³¤±¤ë²ÄǽÀ­¤¬¤¢¤ê¤Þ¤¹)
> git clone https://github.com/nabokov/mecab-dic-overdrive.git
> tar -xvzf mecab-ipadic-2.7.0-20070801.tar.gz
(2) config.pl / log.conf ¤ÎÀßÄê

mecab-dic-overdrive/etc/config.pl ¤ÎÆâÍÆ¤ò´Ä¶­¤Ë¤¢¤ï¤»¤Æ¥«¥¹¥¿¥Þ¥¤¥º¤¹¤ë¡£ºÇÄã¤Ç¤â

  • $HOME (mecab-dic-overdrive ¤ò²òÅष¤¿¥Ç¥£¥ì¥¯¥È¥ê)
  • $DIC_SRC_DIR (mecab-ipadic-2.7.0-20070801 ¤ò²òÅष¤¿¥Ç¥£¥ì¥¯¥È¥ê)

¤ÏÊÔ½¸¤·¤Æ¤¯¤À¤µ¤¤¡£

¤Þ¤¿¡¤¥Î¡¼¥Þ¥é¥¤¥¼¡¼¥·¥ç¥ó¤òÊѹ¹¤·¤¿¤¤¾ì¹ç¤Ï¾å¤Î¡Ö¼­½ñ¤Î¥Î¡¼¥Þ¥é¥¤¥º¡×¤Î¹à¤ò»²¹Í¤Ë default_normalize_opts ¤òÊÔ½¸¤·¤Æ¤¯¤À¤µ¤¤¡£

(Îã)
default_normalize_opts => [qw(decode_entities strip_html nfkc lc)],

ưºî¥í¥°¤Î½ñ¤­½Ð¤·Àè¤òÊѤ¨¤¿¤ê¡¤¥í¥°¥ì¥Ù¥ë¤òÊѤ¨¤¿¤¤¾ì¹ç¤Ï etc/log.conf ¤òÊÔ½¸¤·¤Æ¤¯¤À¤µ¤¤¡£

(Îã)
log4perl.rootLogger=DEBUG, LOGFILE
log4perl.appender.LOGFILE.filename=/path/to/log.txt
(3) utf8²½+¥Î¡¼¥Þ¥é¥¤¥º+¥Ñ¥Ã¥ÁŬÍѤµ¤ì¤¿ mecab-ipadic ¤ÎºîÀ®
>bin/initialize_dic.pl

¤³¤ì¤Ç (1)¼­½ñ¤Îutf-8²½ (2)¼­½ñ¤Ø¤Î¥Ñ¥Ã¥ÁŬÍÑ (3)¼­½ñ¤Î¥Î¡¼¥Þ¥é¥¤¥º (4)¼­½ñ¤Î¥³¥ó¥Ñ¥¤¥ë&¥¤¥ó¥¹¥È¡¼¥ë¡¤¤Þ¤Ç¤¬´°Î»¤·¤Þ¤¹¡£

"make install failed" ¤È¸À¤ï¤ì¤Æ¤·¤Þ¤¦¾ì¹ç¡¤¤¢¤ë¤¤¤Ï´û¸¤Î¼­½ñ (/usr/local/lib/mecab/dic/ipadic) ¤ò»Ä¤·¤ÆÊ̤ξì½ê¤Ø¥¤¥ó¥¹¥È¡¼¥ë¤·¤¿¤¤¾ì¹ç¤Ï¡¤°Ê²¼¤Î¤è¤¦¤ËÊ̤ξì½ê¤Ø¼êºî¶È¤Ç¼­½ñ¤ò¥³¥Ô¡¼¤·¡¤ mecab ¸Æ¤Ó½Ð¤·¤ÎºÝ¤Ë -d ¥ª¥×¥·¥ç¥ó¤ò»È¤Ã¤Æ¼­½ñ¥Ç¥£¥ì¥¯¥È¥ê¤ò»ØÄꤹ¤ë¤è¤¦¤Ë¤·¤Æ¤¯¤À¤µ¤¤¡£

(¼êư¤Ç /usr/local/lib/mecab/dic/ipadic-utf8 ¤Ø¥¤¥ó¥¹¥È¡¼¥ë¤¹¤ë¾ì¹ç¤ÎÎã)

>bin/initialize_dic.pl --noinstall
>mkdir /usr/local/lib/mecab/dic/ipadic-utf8
>cp [ipadic¤Î¥½¡¼¥¹¥Ç¥£¥ì¥¯¥È¥ê]/*.bin /usr/local/lib/mecab/dic/ipadic-utf8/
>cp [ipadic¤Î¥½¡¼¥¹¥Ç¥£¥ì¥¯¥È¥ê]/*.def /usr/local/lib/mecab/dic/ipadic-utf8/
>cp [ipadic¤Î¥½¡¼¥¹¥Ç¥£¥ì¥¯¥È¥ê]/*.dic /usr/local/lib/mecab/dic/ipadic-utf8/
>cp [ipadic¤Î¥½¡¼¥¹¥Ç¥£¥ì¥¯¥È¥ê]/dicrc /usr/local/lib/mecab/dic/ipadic-utf8/

(¤³¤Î¤¢¤È etc/config.pl ¤Î dicdir = "/usr/local/lib/mecab/dic/ipadic" ¤ò
 "/usr/local/lib/mecab/dic/ipadic-utf8" ¤ØÊѹ¹¤¹¤ë)

(mecab ¤ò¥³¥Þ¥ó¥É¥é¥¤¥ó¤«¤é»È¤¦¾ì¹ç¤Ï -d ¥ª¥×¥·¥ç¥ó¤ò»ØÄê)
>mecab -d /usr/local/lib/mecab/dic/ipadic-utf8/

(4) wikipedia¤Î¥Ç¡¼¥¿¤«¤é¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë

ÆüËܸìÈÇwikipedia¤Î¥À¥ó¥×¥µ¥¤¥È¤«¤é jawiki-latest-page.sql.gz ¤òÆþ¼ê¤·¤Æ misc/dic °Ê²¼¤Ë .gz ¤Î¤Þ¤ÞÊݸ¤·¤Þ¤¹¡£( zcat/gzcat ¤¬ÍøÍѤǤ­¤Ê¤¤´Ä¶­¤Ç¤Ï²òÅष¤Æ¤ª¤­¤Þ¤¹¡£¥Õ¥¡¥¤¥ë̾¤äÃÖ¤­¾ì½ê¤òÊѤ¨¤¿¤¤¾ì¹ç¤Ï GenDic/WikipediaFile.pm ¤òŬµ¹Êѹ¹¤·¤Æ¤¯¤À¤µ¤¤¡£)

>bin/generate_dic.pl --target=wikipedia_file

¤È¤¹¤ë¤È¡¤SQL¥Õ¥¡¥¤¥ë¤òľÀܯɤ߹þ¤ó¤Çµ­»ö¥¿¥¤¥È¥ë¤òÃê½Ð¤·¡¤¡Ö¸Çͭ̾»ì/°ìÈ̡פȤ·¤Æ¥æ¡¼¥¶¼­½ñ¥Õ¥¡¥¤¥ë misc/dic/wikipedia.dic ¤Ë½ñ¤­½Ð¤·¤Þ¤¹¡£

¢¨SQLʸ¤òľÀܶ¯°ú¤Ë¥Ñ¡¼¥¹¤¹¤ë»ÅÁȤߤΤ¿¤á¡¤º£¸åwikipedia¤Î¥À¥ó¥×»ÅÍͤËÊѹ¹¤¬¤¢¤ë¤Èư¤«¤Ê¤¯¤Ê¤ë²ÄǽÀ­¤â¤¢¤ê¤Þ¤¹¡£¤½¤Î¾ì¹ç¤Ï¤¤¤Ã¤¿¤ó¥Ç¡¼¥¿¤òDB¤ËÆÉ¤ß¹þ¤ß¡¤DB¤«¤é½ñ¤­½Ð¤·¤ò¹Ô¤¦¤è¤ê³Î¼Â¤ÊÊýË¡( --target=wikipedia_file ¤Î¤«¤ï¤ê¤Ë --target=wikipedia ¤ò»ØÄê) ¤âÍøÍѤǤ­¤Þ¤¹¡£¾Ü¤·¤¤ÀßÄêÊýË¡¤Ï GenDic/Wikipedia.pm ¤ò»²¾È¤·¤Æ¤¯¤À¤µ¤¤¡£

(5) ´éʸ»ú¼­½ñ¤«¤é¥æ¡¼¥¶¼­½ñ¤òºîÀ®¤¹¤ë (optional)

´éʸ»ú¼­½ñÍѤȤ·¤ÆÍÍ¡¹¤Ê¾ì½ê¤ÇÇÛÉÛ¤µ¤ì¤Æ¤¤¤ëtsv¤òÆÉ¤ß¹þ¤ó¤Ç¥æ¡¼¥¶¼­½ñ¤òºî¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

ÆÉ¤ß¹þ¤ß¸µ¤Ï misc/dic/kaomoji.tsv ¤Ë¤¢¤ë¤Î¤Ç¡¤Äɲä·¤¿¤¤´éʸ»ú¤¬¤¢¤ë¾ì¹ç¤Ï¤³¤³¤ËÄɵ­¤·¤¿¤¢¤È¡¤

>bin/generate_dic.pl --target=kaomoji

¤È¤¹¤ë¤È¡¤³Æ´éʸ»ú¤ò¡Öµ­¹æ/°ìÈ̡פȤ·¤Æ misc/dic/kaomoji.dic ¤Ë½ñ¤­½Ð¤·¤Þ¤¹¡£

¢¨wikipedia.dic ¤Ë¤¢¤ëµ­¹æ·Ï¥¨¥ó¥È¥ê¤è¤êÍ¥ÀèÅÙ¤ò¹â¤¯¤¹¤ë¤¿¤á¤Ë¡¤Àè¤ËºîÀ®¤·¤¿ wikipedia.dic ¤òÆÉ¤ß¹þ¤ó¤À mecab ¤ò¤Ä¤«¤Ã¤ÆÀ¸µ¯¥³¥¹¥È·×»»¤ò¤¹¤ë¤è¤¦¤Ë¤Ê¤Ã¤Æ¤¤¤Þ¤¹¡£¤½¤Î¤¿¤á¡¤misc/dic/wikipedia.dic ¤¬¤Ê¤¤¤Èư¤­¤Þ¤»¤ó¡£¤³¤Î»ÅÍͤòÊѹ¹¤·¤¿¤¤¾ì¹ç¤Ï GenDic/Kaomoji.pm ¤Î defaults ¥á¥½¥Ã¥É¤òÊÔ½¸¤·¤Æ¤¯¤À¤µ¤¤¡£

(6) ¤½¤Î¾¡¤wikipedia¤Ë¤Ê¤¤¸Çͭ̾»ì¤Ê¤É¤òÄɲ乤ë (optional)

¾åµ­°Ê³°¤ËÄɲä·¤¿¤¤Ì¾»ì¤¬¤¢¤ë¾ì¹ç¤Ï misc/dic/simple_list.txt ¤Ë²þ¹Ô¶èÀÚ¤ê¤ÇÎóµó¤·¡¤

>bin/generate_dic.pl --target=simple_list

¤È¤¹¤ë¤È¡¤¤½¤ì¤é¤ò¤¹¤Ù¤Æ¡Ö̾»ì/¸Çͭ̾»ì/°ìÈ̡פȤ·¤ÆÆÉ¤ß¹þ¤ß¡¤misc/dic/simple_list.dic ¤Ë½ñ¤­½Ð¤·¤Þ¤¹¡£

ºîÀ®¤·¤¿¼­½ñ¤ÎÍøÍÑ

¾åµ­¥¹¥Æ¥Ã¥×(4)-(6)¤ÇºîÀ®¤·¤¿¥æ¡¼¥¶¼­½ñ¤Ï¡¤mecab ¤Î -u ¥ª¥×¥·¥ç¥ó¤Ç»ØÄꤷ¤ÆÍøÍѤǤ­¤Þ¤¹¡£

>mecab -u misc/dic/wikipedia.dic,misc/dic/kaomoji.dic,misc/dic/simple_list.dic
(»ÈÍÑÁ°)
> mecab
³³¤Î¾å¤Î¥Ý¥Ë¥ç¥Ë¥³Æ°¤Ç¸«¤¿
³³	̾»ì,°ìÈÌ,*,*,*,*,³³,¥¬¥±,¥¬¥±
¤Î	½õ»ì,Ï¢Âβ½,*,*,*,*,¤Î,¥Î,¥Î
¾å	̾»ì,Èó¼«Î©,Éû»ì²Äǽ,*,*,*,¾å,¥¦¥¨,¥¦¥¨
¤Î	½õ»ì,Ï¢Âβ½,*,*,*,*,¤Î,¥Î,¥Î
¥Ý¥Ë¥ç¥Ë¥³	̾»ì,°ìÈÌ,*,*,*,*,*
ư	̾»ì,°ìÈÌ,*,*,*,*,ư,¥É¥¦,¥É¡¼
¤Ç	½õ»ì,³Ê½õ»ì,°ìÈÌ,*,*,*,¤Ç,¥Ç,¥Ç
¸«	ư»ì,¼«Î©,*,*,°ìÃÊ,Ï¢ÍÑ·Á,¸«¤ë,¥ß,¥ß
¤¿	½õư»ì,*,*,*,ÆÃ¼ì¡¦¥¿,´ðËÜ·Á,¤¿,¥¿,¥¿
EOS

(»ÈÍѸå)
>mecab -u misc/dic/wikipedia.dic
³³¤Î¾å¤Î¥Ý¥Ë¥ç¥Ë¥³Æ°¤Ç¸«¤¿
³³¤Î¾å¤Î¥Ý¥Ë¥ç	̾»ì,¸Çͭ̾»ì,°ìÈÌ,*,*,*,³³¤Î¾å¤Î¥Ý¥Ë¥ç,Wikipedia:1070057
¥Ë¥³Æ°	̾»ì,¸Çͭ̾»ì,°ìÈÌ,*,*,*,¥Ë¥³Æ°,Wikipedia:1347271
¤Ç	½õ»ì,³Ê½õ»ì,°ìÈÌ,*,*,*,¤Ç,¥Ç,¥Ç
¸«	ư»ì,¼«Î©,*,*,°ìÃÊ,Ï¢ÍÑ·Á,¸«¤ë,¥ß,¥ß
¤¿	½õư»ì,*,*,*,ÆÃ¼ì¡¦¥¿,´ðËÜ·Á,¤¿,¥¿,¥¿
EOS

utf-8ÈÇ ipadic ¤ò¥Ç¥Õ¥©¥ë¥È¤Î¾ì½ê¤È¤Ï°ã¤¦¾ì½ê¤Ë½ñ¤­½Ð¤·¤¿¾ì¹ç¤Ï¡¤-d ¥ª¥×¥·¥ç¥ó¤â»ØÄꤷ¤Þ¤¹¡£

>mecab -d /usr/local/lib/mecab/dic/ipadic-utf8

¥Î¡¼¥Þ¥é¥¤¥¶¤ÎÍøÍÑ

Á°½Ò¤Î¤è¤¦¤Ë¡¤²òÀÏ»þ¤Ë¤Ï¼­½ñºîÀ®»þ¤ÈƱ¤¸¥Î¡¼¥Þ¥é¥¤¥¶¤òÄ̤µ¤Ê¤¤¤È°ÕÌ£¤¬¤Ê¤¤¤Î¤Ç¡¤ºîÀ®¤·¤¿¼­½ñ¤ò»È¤¦¾ì¹ç¤Ë¤Ï°Ê²¼¤Î¤è¤¦¤Ë MecabTrainer::NormalizeText ¤Î¥¤¥ó¥¹¥¿¥ó¥¹¤òÄ̤¹¤è¤¦¤Ë¤·¤Æ¤¯¤À¤µ¤¤¡£

use Encode;
use MecabTrainer::NormalizeText;
new $normalizer = MecabTrainer::NormalizeText->new(
    [decode_entities strip_single_nl nfkc lc]
);

$normalized_decoded_text = $normalizer->normalize(
    Encode::decode('utf8', $raw_input_text)
)

»ÈÍÑÊýË¡¤Ë¤Ä¤¤¤Æ¤Ï bin/normalize_text.pl ¤Î¥½¡¼¥¹¤Ê¤É¤â»²¾È¡£

¥³¥Þ¥ó¥É¥é¥¤¥ó¤Ç»È¤¦¾ì¹ç¤Ë¤Ï bin/normalize_text.pl ¤ò¥Ñ¥¤¥×¤Ç¤«¤Þ¤»¤ÆÍøÍѤ¹¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

»ÈÍÑÎã¤È²òÀÏ·ë²Ì¤ÎÎã¤ò°Ê²¼¤Ë¤¤¤¯¤Ä¤«ºÜ¤»¤Æ¤ª¤­¤Þ¤¹¡£

> bin/normalize_text.pl | mecab -d /usr/local/lib/mecab/dic/ipadic-utf8/ -u misc/dic/wikipedia.dic,misc/dic/kaomoji.dic
¤Ò¤Þ¤Ê¤¦(¡­¡¦¦Ø¡¦¡®)
^D
¤Ò¤Þ	̾»ì,°ìÈÌ,*,*,*,*,¤Ò¤Þ,¥Ò¥Þ,¥Ò¥Þ
¤Ê¤¦	½õ»ì,½ª½õ»ì,*,*,*,*,¤Ê¤¦,¥Ê¥¦,¥Ê¥¦
( ́¡¦¦Ø¡¦`)	̾»ì,¸Çͭ̾»ì,°ìÈÌ,*,*,*,( ́¡¦¦Ø¡¦`),Wikipedia:700982
EOS

µ×¡¹¹¹¿·¡Á ¤ªÊ¢¤Ø¤Ã¤¿¤ç
^D
µ×¡¹	̾»ì,°ìÈÌ,*,*,*,*,µ×¡¹,¥Ò¥µ¥Ó¥µ,¥Ò¥µ¥Ó¥µ
¹¹¿·	̾»ì,¥µÊÑÀܳ,*,*,*,*,¹¹¿·,¥³¥¦¥·¥ó,¥³¡¼¥·¥ó
¡¼	µ­¹æ,°ìÈÌ,*,*,*,*,¨¡,¨¡,¨¡
¤ªÊ¢	̾»ì,°ìÈÌ,*,*,*,*,¤ªÊ¢,¥ª¥Ê¥«,¥ª¥Ê¥«
¤Ø¤Ã	ư»ì,¼«Î©,*,*,¸ÞÃÊ¡¦¥é¹Ô,Ï¢ÍÑ¥¿Àܳ,¤Ø¤ë,¥Ø¥Ã,¥Ø¥Ã
¤¿	½õư»ì,*,*,*,ÆÃ¼ì¡¦¥¿,´ðËÜ·Á,¤¿,¥¿,¥¿
¤ç	½õ»ì,½ª½õ»ì,*,*,*,*,¤è,¤è,¤è

Ä«¤«¤é¥Æ¥´¥Þ¥¹¤Î¤¢¤¤CM¤ä¤Ã¤Æ¤¿
^D
Ä«	̾»ì,Éû»ì²Äǽ,*,*,*,*,Ä«,¥¢¥µ,¥¢¥µ
¤«¤é	½õ»ì,³Ê½õ»ì,°ìÈÌ,*,*,*,¤«¤é,¥«¥é,¥«¥é
¥Æ¥´¥Þ¥¹¤Î¤¢¤¤	̾»ì,¸Çͭ̾»ì,°ìÈÌ,*,*,*,¥Æ¥´¥Þ¥¹¤Î¤¢¤¤,Wikipedia:2035668
cm	̾»ì,°ìÈÌ,*,*,*,*,£Ã£Í,¥·¡¼¥¨¥à,¥·¡¼¥¨¥à
¤ä¤Ã	ư»ì,¼«Î©,*,*,¸ÞÃÊ¡¦¥é¹Ô,Ï¢ÍÑ¥¿Àܳ,¤ä¤ë,¥ä¥Ã,¥ä¥Ã
¤Æ	ư»ì,Èó¼«Î©,*,*,°ìÃÊ,Ï¢ÍÑ·Á,¤Æ¤ë,¥Æ,¥Æ
¤¿	½õư»ì,*,*,*,ÆÃ¼ì¡¦¥¿,´ðËÜ·Á,¤¿,¥¿,¥¿
EOS

( Žß¢ÏŽß)ޱŽÊŽÊȬȬŽÉ¡³ŽÉ¡³ŽÉ¡³ŽÉ ¡À / ¡À/ ¡À
^D
( ゚¢Ï゚)¥¢¥Ï¥ÏȬȬ¥Î¡³¥Î¡³¥Î¡³¥Î  / / 	µ­¹æ,°ìÈÌ,*,*,*,*,( ゚¢Ï゚)¥¢¥Ï¥ÏȬȬ¥Î¡³¥Î¡³¥Î¡³¥Î  / / \n

¥«¥¹¥¿¥à¤ÎÆÉ¤ß¹þ¤ß¥¯¥é¥¹¤ÎºîÀ®

GenDic/ °Ê²¼¤Ë¥µ¥Ö¥¯¥é¥¹¤òºîÀ®¤¹¤ë¤³¤È¤Ç¡¤Ç¤°Õ¤ÎÆþÎϤ«¤é¥æ¡¼¥¶¼­½ñ¤ò¤Ä¤¯¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£MecabTrainer::GenDic ¥¯¥é¥¹¤ò·Ñ¾µ¤·¤Æ

  • ÆþÎÏ¥¹¥È¥ê¡¼¥à¤Î³«¤­Êý
  • ÆþÎϤò°ì¹Ô¤º¤ÄÆÉ¤ß¡¤¥Ñ¡¼¥¹¤¹¤ëÊýË¡
  • ÆÉ¤ß¤È¤Ã¤¿Ã±¸ì¤Ë¤É¤ó¤ÊÉʻ졤features¤ò³ä¤êÅö¤Æ¤ë¤«

¤òµ­½Ò¤·¤Æ¤ª¤±¤Ð¡¤À¸µ¯¥³¥¹¥È¤Î·×»»¤ä¼­½ñ¤Î¥³¥ó¥Ñ¥¤¥ë¤Ï¿Æ¥¯¥é¥¹¤¬¤¹¤Ù¤Æ¸ªÂå¤ï¤ê¤·¤Æ¤¯¤ì¤ë»ÅÁȤߤǤ¹¡£¾Ü¤·¤¯¤Ï GenDic ¥Ç¥£¥ì¥¯¥È¥ê°Ê²¼¤Î³Æ¥½¡¼¥¹¤ò»²¾È¤·¤Æ²¼¤µ¤¤¡£

generate_dic.pl ¤Î --target ¥ª¥×¥·¥ç¥ó¤Ç¥µ¥Ö¥¯¥é¥¹Ì¾¤ò»ØÄê (CamelCase¤ò¾®Ê¸»ú+"_"¤ËÃÖ¤­´¹¤¨) ¤¹¤ë¤³¤È¤Ç¡¤ºîÀ®¤·¤¿¥µ¥Ö¥¯¥é¥¹¤ò¸Æ¤Ó½Ð¤¹¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

(¥µ¥Ö¥¯¥é¥¹ TestClass.pm ¤ò»ØÄꤹ¤ë¾ì¹ç¤ÎÎã)
>bin/generate_dic.pl --target=test_class

decision tree (·èÄêÌÚ) ¤Ç¥æ¡¼¥¶¥¨¡¼¥¸¥§¥ó¥ÈȽÄê´ï¤òºî¤Ã¤Æ¤ß¤ë

¥«¥Æ¥´¥ê
¥Ö¥Ã¥¯¥Þ¡¼¥¯¿ô
¤³¤Î¥¨¥ó¥È¥ê¡¼¤ò´Þ¤à¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯ ¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯ - decision tree (·èÄêÌÚ) ¤Ç¥æ¡¼¥¶¥¨¡¼¥¸¥§¥ó¥ÈȽÄê´ï¤òºî¤Ã¤Æ¤ß¤ë
¤³¤Î¥¨¥ó¥È¥ê¡¼¤ò¤Ï¤Æ¤Ê¥Ö¥Ã¥¯¥Þ¡¼¥¯¤ËÄɲÃ

¥¢¥¯¥»¥¹¥í¥°¤Î¥æ¡¼¥¶¥¨¡¼¥¸¥§¥ó¥È(UA)¤«¤é¥Ö¥é¥¦¥¶¤òȽÊ̤¹¤ë¤Î¤Ã¤Æ¡¤¤ß¤ó¤Ê²¿»È¤Ã¤Æ¤Þ¤¹¤«¡©

¼«Ê¬¤¬ºî¤Ã¤¿¥¢¥¯¥»¥¹²òÀÏ¥·¥¹¥Æ¥à¤Ç¤Ï HTTP::BrowserDetect ¤È HTTP::MobileAgent ¤Ë¤½¤ì¤¾¤ìÆÈ¼«¥Ñ¥Ã¥Á¤ò¤¢¤Æ¤¿¤â¤Î¤ò»È¤Ã¤Æ¤¤¤Þ¤¹¡£¤³¤ì¤é¤Ï¥ë¡¼¥ë¥Ù¡¼¥¹¤ÎȽÄê´ï¤Ê¤Î¤Ç¡¤¿·¤·¤¤¥Ö¥é¥¦¥¶¤ä¿·¼ï¤Î bot ¤¬Åо줹¤ë¤¿¤Ó¤Ë¼êºî¶È¤Ç¥ë¡¼¥ë¤òÄɲä·¡¤¥Ñ¥Ã¥Á¤òºî¤Ã¤ÆÇÛÉÛ¤¹¤ë¤È¤¤¤¦ºî¶È¤¬É¬Íפˤʤê¤Þ¤¹¡£

¤³¤Î¹¹¿·ºî¶È¤¬ÂçÊÑÌÌÅݤ¯¤µ¤¯¤ÆÂбþ¤¬Ã٤줬¤Á¤Ë¤Ê¤ë¤Î¤Ç¡¤¡Ö¤³¤ÎUAʸ»úÎó¤Ï¤³¤Î¥Ö¥é¥¦¥¶¤Ç¤¹¤è¡¢¤È¤¤¤¦Îã¤òÂçÎ̤ËÍ¿¤¨¤¿¤é¡¢¼«Ê¬¤Ç¾¡¼ê¤ËȽÄê¥ë¡¼¥ë¤ò³Ø½¬¤·¤Æ¤¯¤ì¤ë¤è¤¦¤Ë¤Ê¤Ã¤¿¤éÊØÍø¤Ê¤Î¤Ë¤Ê¤¡¡×¤È»×¤¤¡¤decision tree (·èÄêÌÚ)¤ò»È¤Ã¤Æ¤ß¤ë¤³¤È¤ò»×¤¤Î©¤Á¤Þ¤·¤¿¡£

ÌÜɸ¤Ï¡¤

  • "Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ¤Ï Firefox
  • "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7" ¤Ï Safari
  • ...

¤È¤¤¤¦¤Õ¤¦¤ËÎã¤òÍ¿¤¨¤Æ¤¤¤¯¤È¡¤UAʸ»úÎ󤫤é¥Ö¥é¥¦¥¶¤òȽÄꤹ¤ë¥ë¡¼¥ë¤ò¼«Æ°Åª¤Ë³ÍÆÀ¤¹¤ë¥×¥í¥°¥é¥à¤òºîÀ®¤¹¤ë¤³¤È¤Ç¤¹¡£Perl¤Î¾ì¹ç AI::DecisionTree¤È¤¤¤¦¥é¥¤¥Ö¥é¥ê¤¬¤¢¤Ã¤Æ¡¤¼ê·Ú¤Ë decision tree ¤ÇÍ·¤Ö¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

BlogPaint

decision tree ¤Îư¤­¤ò¸«¤Æ¤ß¤ë

°Ê²¼¤Î´Êñ¤Ê¥¹¥¯¥ê¥×¥È dt_test.pl ¤ò»È¤Ã¤Æ¡¤AI::DecisionTree ¤¬¥ë¡¼¥ë¤ò³Ø½¬¤¹¤ë²áÄø¤ò¤ß¤Æ¤¤¤­¤Þ¤·¤ç¤¦¡£(¤¢¤é¤«¤¸¤áCPAN·Ðͳ¤ÇAI::DecisionTree¤¬¥¤¥ó¥¹¥È¡¼¥ë¤µ¤ì¤Æ¤¤¤ë¤³¤È¤¬Á°Äó¤Ç¤¹¡£)

¡¦dt_tree.pl

#!/usr/bin/perl

use strict;
use AI::DecisionTree;

my $dtree = new AI::DecisionTree( prune => 0 );

# stdin¤«¤é¶µ»Õ¥Ç¡¼¥¿(UAʸ»úÎó+¥¿¥Ö+Àµ²òʸ»úÎó)ÆÉ¤ß¹þ¤ß
while(<>) {
    chomp;
    my ($attributes, $result) = split(/\t+/);

    $dtree->add_instance(
        attributes => { map { $_ => 1 } split(/\s+/, $attributes) }, # UAʸ»úÎó¤ò¥¹¥Ú¡¼¥¹¤Çʬ³ä¤·¤¿¤â¤Î¤ò¤¹¤Ù¤Æattribute¤È¤¹¤ë¡£
        result => $result,
    );
}

$dtree->train; # ³Ø½¬

# ³Ø½¬¤·¤¿¥ë¡¼¥ë¤òɽ¼¨
print "\n--- rules\n";
print join "\n", $dtree->rule_statements;
print "\n---\n";

ưºîÎã

> ./dt_test.pl
Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15	Firefox/3.6
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7	Safari/4.0
^D
--- rules
if Mac='' -> 'Firefox/3.6'
if Mac='1' -> 'Safari/4.0'
---
>

¥¤¥¿¥ê¥Ã¥¯¤ÎÉôʬ¤¬¥æ¡¼¥¶ÆþÎϤǤ¹¡£dt_test.pl ¤òµ¯Æ°¤·¤¿¤¢¤È¡¤Firefox ¤È Safari ¤òɽ¤¹2¹Ô¤Î¶µ»Õ¥Ç¡¼¥¿ (UAʸ»úÎó¤ÈÀµ²ò¤Î¥Ö¥é¥¦¥¶Ì¾¤ò¥¿¥Ö¤Ç¤Ä¤Ê¤²¤¿¤â¤Î) ¤òÆþÎϤ·¤Æ¤¤¤Þ¤¹¡£¥×¥í¥°¥é¥à¤Ï¤³¤Î¶µ»Õ¥Ç¡¼¥¿¤ò

  • ¡Ö"Mozilla/5.0", "(Windows;", "U;", "NT", "6.1;" ... ¤Ê¤É¤Îʸ»úÎ󤬤¢¤ì¤Ð "Firefox/3.6"¡×
  • ¡Ö"Mozilla/5.0", "(iPod;", "U;", "CPU" ... ¤Ê¤É¤Îʸ»úÎ󤬤¢¤ì¤Ð "Safari/4.0"¡×

¤È¤¤¤¦¤Õ¤¦¤Ëµ­Ï¿¤·¤Æ¤¤¤­¤Þ¤¹¡£¤³¤Î "Mozilla/5.0" ¤Ê¤É¤Îʸ»úÎó¤Ò¤È¤Ä¤Ò¤È¤Ä¤ò attribute ¤È¸Æ¤Ó¤Þ¤¹¡£¤³¤³¤Ç¤ÏÍ¿¤¨¤é¤ì¤¿UAʸ»úÎó¤ò¶õÇò¤Çʬ³ä¤·¤¿¤â¤Î¤ò attribute ¤È¤·¤ÆÍøÍѤ·¤Æ¤¤¤Þ¤¹¡£

¥×¥í¥°¥é¥à¤ÏÍ¿¤¨¤é¤ì¤¿Îã¤ÎÃæ¤«¤é "Firefox/3.6" ¤È "Safari/4.0" ¤ò¸«Ê¬¤±¤ë°ìÈÖ´Ê·é¤Ê¥ë¡¼¥ë¤òõ¤·¤Þ¤¹¡£¤½¤Î·ë²Ì¡¤"Mac" ¤È¤¤¤¦ attribute ¤ËÃíÌܤ·¡¤

  • ¡Ö"Mac" ¤È¤¤¤¦ attribute ¤¬¤Ê¤«¤Ã¤¿¤é Firefox¡¤¤¢¤Ã¤¿¤é Safari¡×

¤È¤¤¤¦¥ë¡¼¥ë¤òƳ¤­¤À¤·¤Þ¤·¤¿¡£

¼ÂºÝ¤Ë¤Ï "Mac" ¤¬¤¢¤Ã¤Æ¤â Firefox ¤Î²ÄǽÀ­¤Ï¤¢¤ê¤Þ¤¹¡£¤½¤ÎÎã¤ò¶µ»Õ¥Ç¡¼¥¿¤È¤·¤ÆÄɲä·¡¤¥×¥í¥°¥é¥à¤¬¥ë¡¼¥ë¤ò¤É¤¦½¤Àµ¤¹¤ë¤«¤ß¤Æ¤ß¤Þ¤·¤ç¤¦¡£

> ./dt_test.pl
Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15	Firefox/3.6
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7	Safari/4.0
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; ja-JP-mac; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 GTB7.1	Firefox/3.6
^D
--- rules
if like='' -> 'Firefox/3.6'
if like='1' -> 'Safari/4.0'
---
>

º£ÅÙ¤Ï "Mac" ¤È¤¤¤¦ attribute ¤Î̵ͭ¤Ç¤Ï Firefox ¤È Safari ¤È¤ò¸«Ê¬¤±¤é¤ì¤Ê¤«¤Ã¤¿¤¿¤á¡¤

  • ¡Ö"like" ¤¬¤Ê¤±¤ì¤Ð Firefox¡¤¤¢¤ì¤Ð Safari¡×

¤È¤¤¤¦¥ë¡¼¥ë¤ËÊѤï¤ê¤Þ¤·¤¿¡£¤Þ¤¢³Î¤«¤Ë¸À¤ï¤ì¤Æ¤ß¤ì¤Ð...¤È¤¤¤¦µ¤¤¬¤·¤Ê¤¤¤Ç¤â¤Ê¤¤¤Ç¤¹¤¬¡¤¤³¤³¤Ç¤µ¤é¤Ë Internet Explorer ¤È Chrome ¤ÎÎã¤òÆó¤Ä²Ã¤¨¤Æº®Í𤵤»¤Æ¤ß¤Þ¤·¤ç¤¦¡£

>./dt_test.pl 
Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15	Firefox/3.6
Mozilla/5.0 (iPod; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7	Safari/4.0
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; ja-JP-mac; rv:1.9.2.11) Gecko/20101012 Firefox/3.6.11 GTB7.1	Firefox/3.6
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6.6; .NET CLR 1.1.4322)	Internet Explorer/8.0
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.102 Safari/534.13	Chrome/9.0
^D
--- rules
if like='' and MSIE='' -> 'Firefox/3.6'
if like='' and MSIE='1' -> 'Internet Explorer/8.0'
if like='1' and AppleWebKit/534.13='' -> 'Safari/4.0'
if like='1' and AppleWebKit/534.13='1' -> 'Chrome/9.0'
---
>

º£ÅÙ¤ÏȽÄ꤬°Ê²¼¤Î¤è¤¦¤ËÆóÃʳ¬¤Ë¤Ê¤ê¤Þ¤·¤¿¡£

  • "like" ¤¬¤Ê¤«¤Ã¤¿¤é
    • ¢ª "MSIE" ¤¬¤Ê¤«¤Ã¤¿¤é Firefox¡¤¤¢¤Ã¤¿¤é InternetExplorer
  • "like" ¤¬¤¢¤Ã¤¿¤é
    • ¢ª "AppleWebKit/534.13" ¤¬¤Ê¤«¤Ã¤¿¤é Safari¡¤¤¢¤Ã¤¿¤é Chrome

¤Þ¤À¤Þ¤ÀUA¤ÎȽÄê¥ë¡¼¥ë¤È¤·¤Æ»È¤¤Êª¤Ë¤Ê¤ë¥ì¥Ù¥ë¤Ç¤Ï¤Ê¤¤¤Ç¤¹¤¬¡¤¤¿¤Ã¤¿5Î㤷¤«¶µ»Õ¥Ç¡¼¥¿¤òÍ¿¤¨¤Æ¤¤¤Ê¤¤¤Î¤Ë¤³¤³¤Þ¤Ç´Ê·é¤Ê¥ë¡¼¥ë¤òƳ¤­¤À¤·¤¿¤³¤È¤ËÃíÌܤ·¤Æ¤¯¤À¤µ¤¤¡£

¤³¤Î¤è¤¦¤Ë decision tree ¤Ï¡¤Í¿¤¨¤é¤ì¤¿Îã¤ò°ìÈ̲½¤·¤Æ¥Ä¥ê¡¼¾õ¤ÎʬÎà¥ë¡¼¥ë¤ò¿äÄꤷ¤Æ¤¯¤ì¤ë¤ï¤±¤Ç¤¹¡£

¤â¤¦¾¯¤·¼ÂÍÑŪ¤ÊUAȽÄê´ï¤È¡¤¶µ»Õ¥Ç¡¼¥¿¥»¥Ã¥È¤òºîÀ®¤¹¤ë

¤³¤³¤Þ¤Ç¤Ç decision tree ¤Îư¤­¤ò³Îǧ¤·¤Æ¤­¤Þ¤·¤¿¤¬¡¤¤³¤ì¤ò¼ÂÍÑŪ¤Ê¤â¤Î¤Ë¤¹¤ë¤Ë¤Ï

  1. UAʸ»úÎó¤ò¥¹¥Ú¡¼¥¹¤À¤±¤Çʬ³ä¤¹¤ë¤Î¤ÏÂ绨ÇĤ¹¤®¤ë¤Î¤Ç¡¤¤â¤¦¾¯¤·¹©Éפ·¤ÆÀºÅÙ¤ò¾å¤²¤ë¡£
  2. ¼ïÊÌ (PC¥Ö¥é¥¦¥¶/¥â¥Ð¥¤¥ë¥Ö¥é¥¦¥¶/¥¯¥í¡¼¥é etc.)¡¤OS̾¡¤¥Ö¥é¥¦¥¶Ì¾¤Î»°Ãʳ¬¤ÎȽÊ̤¬¤Ç¤­¤ë¤è¤¦¤Ë¤¹¤ë¡£
  3. ÂçÎ̤ζµ»Õ¥Ç¡¼¥¿¤òÍ¿¤¨¤Æ¡¤¤¢¤é¤æ¤ë¥Ñ¥¿¡¼¥ó¤òÌÖÍå¤Ç¤­¤ë¤è¤¦¤Ë¤¹¤ë¡£
  4. (¤·¤«¤·¡¤ÂçÎ̤ζµ»Õ¥Ç¡¼¥¿ÆÉ¤ß¹þ¤ß¤Ë¤Ï»þ´Ö¤¬¤«¤«¤ë¤Î¤Ç) ¤¤¤Ã¤¿¤ó³Ø½¬¤·¤¿¥ë¡¼¥ë¤òÊݸ¤·¤Æ¤ª¤±¤ë¤è¤¦¤Ë¤¹¤ë¡£

¤Ê¤É¤ÎÅØÎϤ¬É¬ÍפǤ¹¡£

UAʸ»úÎ󤫤é attribute ¤òÃê½Ð¤¹¤ëÊýË¡¤ò¹©Éפ¹¤ë

Î㤨¤Ð

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.13 (KHTML, like Gecko)

¤È¤¤¤¦UAʸ»ú¤ò¸«¤Æ¤ß¤ë¤È¡¤

  • ¶õÇò¤è¤ê¤âÀè¤Ë¡¤"(", ")", ",", ";" ¤Ê¤É¤Îµ­¹æ¤Çʬ³ä¤¹¤ë¡£
    • ¢ª "Windows NT 6.1", "Mozilla/5.0" ¤È¤¤¤Ã¤¿Ê¸»úÎ󤬡¤¤Ò¤È¤Þ¤È¤Þ¤ê¤Î¤Þ¤Þ attribute ¤Ë¤Ê¤ë¡£
  • ¤½¤ì¤é¤ò¤µ¤é¤Ë¶õÇò¤ä"/" ¤Ê¤É¤Çʬ³ä¤·¤Æ¤ß¤Æ¡¤Ê¬³ä¤Ç¤­¤¿¤é¤½¤ì¤é¤â attribute ¤Ë²Ã¤¨¤ë¡£
    • ¢ª "Windows NT 6.1", "Windows", "NT", "6.1", "Mozilla/5.0", "Mozilla", "5.0" ¤Ê¤É¤¬¤¹¤Ù¤Æ attribute ¤Ë¤Ê¤ë¡£

¤È¤¤¤¦ÆóÃʳ¬¤Îʸ»úÎóʬ²ò¤ò¤¹¤ë¤È¤¦¤Þ¤¯¤¤¤­¤½¤¦¤Êµ¤¤¬¤·¤Þ¤¹¡£

ºÇ½é¤«¤é¶õÇò¤ä"/"¤Çʬ³ä¤·¤Æ¤·¤Þ¤ï¤Ê¤¤¤Î¤Ï¡¤Î㤨¤Ð "Windows NT 5.0" ¤È "Mozilla/5.0" ¤Î "5.0" ¤ÏÁ°¤Îʸ»úÎó¤È¤¯¤Ã¤Ä¤¤¤¿¾õÂ֤Ǥʤ¤¤È°ÕÌ£¤ò¤Ê¤µ¤Ê¤¤¤«¤é¤Ç¤¹¡£¤«¤È¤¤¤Ã¤Æ "Mozilla/5.0" ¤Î¤è¤¦¤Ë¥Ð¡¼¥¸¥ç¥óÈֹ椬¤¯¤Ã¤Ä¤¤¤¿¤â¤Î¤À¤±¤ò attribute ¤Ë¤·¤Æ¤·¤Þ¤¦¤È¡¤Ê¬Îà´ï¤¬ "Mozilla/4.0" ¤È "Mozilla/5.0" ¤ÎÎà»÷À­¤Ëµ¤¤Å¤«¤º¡¤°ìÈ̲½¤Ë¶ìÏ«¤¹¤ë¤Ï¤á¤Ë¤Ê¤ê¤Þ¤¹¡£¤½¤³¤Ç¡¤ºÙʬ³äÁ°¤Îʸ»úÎó¤È¸å¤Îʸ»úÎó¤òξÊý attribute ¤È¤·¤Æµ­Ï¿¤·¤Æ¤ª¤­¡¤¤É¤Î attribute ¤òºÎÍѤ¹¤ë¤«¤Ï AI::decisionTree ¤ËǤ¤»¤ë¤³¤È¤Ë¤·¤Þ¤¹¡£

  • UAʸ»úÎ󤫤顤attribute ¤Î¸õÊä¤ò¤¹¤Ù¤Æ array ref ¤ÇÊÖ¤¹¥á¥½¥Ã¥É¤ÎÎã
sub breakdown_ua {
    my ($ua) = @_;

    $ua = lc($ua);
    my @ua_str = map { s/^\s+//;s/\s+$//;$_ } grep $_, split (/[,;\"\(\)]/, $ua);
    my @sub_ua_str;
    for (@ua_str) {
        push @sub_ua_str, grep { $_ !~ /^[0-9\._\-\+]*$/ } split(/[\s\/\-]/);
    }

    return [@ua_str, @sub_ua_str];
}

¢¨UAʸ»úÎó¤ÎÉÕ¤±Êý¤Ë´Ø¤·¤Æ¤Ï¡¤[¤³¤³]¤ä[¤³¤³]¤¬»²¹Í¤Ë¤Ê¤ê¤Þ¤·¤¿¡£

UA¼ïÊÌ¡¤OS̾¡¤¥Ö¥é¥¦¥¶Ì¾¤Î»°Ãʳ¬¤ÎȽÊ̤¬¤Ç¤­¤ë¤è¤¦¤Ë¤¹¤ë

UAȽÄê´ï¤Îŵ·¿Åª¤Ê»È¤¤Êý¤È¤·¤Æ¤Ï¡¤

  • "Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    • ¢ª PC¥Ö¥é¥¦¥¶ / Windows 7 / Firefox/3.6
  • "DoCoMo/2.0 N01B(c500;TB;W24H16) Mobile Browser DoCoMo N01B"
    • ¢ª ¥â¥Ð¥¤¥ë¥Ö¥é¥¦¥¶ / DoCoMo / N01B
  • "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) "
    • ¢ª ¥¯¥í¡¼¥é / - / Googlebot

¤È¤¤¤¦¤Õ¤¦¤Ë°ì¤Ä¤ÎUAʸ»úÎ󤫤黰Ãʳ¬¤Î¥¿¥¤¥×ȽÄê¤ò¤¹¤ë¤³¤È¤¬µó¤²¤é¤ì¤Þ¤¹¡£

¤·¤«¤·¡¤decision tree ¤Ï´ðËÜŪ¤Ë¤Ï¡ÖÍ¿¤¨¤é¤ì¤¿ attribute ¤Î¥»¥Ã¥È¤«¤é¡¤¤Ò¤È¤Ä¤ÎȽÄê·ë²Ì¤òƳ¤­¤À¤¹¡×¤¿¤á¤Ë»È¤ï¤ì¤Þ¤¹¡£¾å¤ÎÎã¤Î¤è¤¦¤Ë»°Ãʳ¬¤ÎȽÄê¤òƱ»þ¤Ë¤·¤¿¤¤¾ì¹ç¤Ï

  1. ¡Ö"Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ¤Ï "PC - Windows 7 - Firefox/3.6" ¤È¤¤¤¦Ì¾Á°¤Î¥Ö¥é¥¦¥¶¤Ç¤¹¡×¤È¤¤¤¦¤è¤¦¤Ë¡¤¼ïÊÌ/OS̾/¥Ö¥é¥¦¥¶Ì¾¤ò¤Ò¤È¤Þ¤È¤Þ¤ê¤Î result ¤È¤·¤Æ°·¤¦¡£
  2. ʬÎà´ï¤ò»°¤ÄÍѰդ·¡¤¤½¤ì¤¾¤ì¤òÆÈΩ¤Ë¡Ö"Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ¤Ï "PC¥Ö¥é¥¦¥¶" ¤Ç¤¹¡×¡Ö"Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ¤Ï "Windows 7" ¤Ç¤¹¡×¡Ö"Mozilla/5.0 (Windows; U; Windows NT 6.1; ja; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15" ¤Ï "Firefox/3.6" ¤Ç¤¹¡×¤È·±Îý¤·¤Æ¤¤¤¯¡£

¤ÎÆó¼ïÎà¤ÎÊýË¡¤¬¤¢¤ê¤Þ¤¹¡£ÁȤ߹ç¤ï¤»¤¬Áý¤¨¤ëʬ¡¤Á°¼Ô¤Î¤Û¤¦¤¬¥ë¡¼¥ë¤¬ÌµÂ̤ËÊ£»¨¤Ë¤Ê¤ëµ¤¤¬¤·¤¿¤Î¤Ç¡¤º£²ó¤Ï¸å¼Ô¤ÎÊýË¡¤òºÎÍѤ·¤Þ¤·¤¿¡£

½½Ê¬¤ÊÎ̤ζµ»Õ¥Ç¡¼¥¿¤òÍѰդ¹¤ë

ºÇ½é¤ÎÎã¤Ç¤ß¤¿¤è¤¦¤Ë¡¤¥Ç¡¼¥¿Î̤¬½½Ê¬¤Ç¤Ê¤«¤Ã¤¿¤êÊФäƤ¤¤¿¤ê¤¹¤ë¤È¼ÂÍÑŪ¤Ê¥ë¡¼¥ë¤¬³Ø½¬¤Ç¤­¤Þ¤»¤ó¡£

º£²ó¤Ï

  1. DoCoMo, SoftBank, AU ¤Î³Æ¥­¥ã¥ê¥¢¤Î¸ø¼°¥µ¥¤¥È¤Î¾ðÊó¤ò¸µ¤ËÆÈ¼«¤ËÀ¸À®¤·¤¿¤â¤Î
  2. useragentstring.com ¤Î¥Ç¡¼¥¿¤ò¿§¡¹¥¢¥ì¤·¤ÆÆÈ¼«¤Ë²Ã¹©¤·¤¿¤â¤Î
  3. livedoor ¤Î¥µ¡¼¥Ð¤Î¼ÂºÝ¤Î¥¢¥¯¥»¥¹¥í¥°¤ÎUA¤ò¥µ¥ó¥×¥ê¥ó¥°¤·¡¤(°ìÉôUA¤ÎüËöID¥Õ¥£¡¼¥ë¥É¤ò¼è¤ê½ü¤¤¤¿¤ê¡¤½ÅÊ£½èÍý¤ò¤·¤¿¸å) ¤¹¤Ç¤ËÍøÍѤ·¤Æ¤¤¤ë¥ë¡¼¥ë¥Ù¡¼¥¹¤ÎUAȽÄê´ï¤Ë¤«¤±¤¿¤â¤Î

¤Î»°¼ïÎà¤Î¥Ç¡¼¥¿¤òÍѰդ·¤Þ¤·¤¿¡£

»î¤·¤Æ¤ß¤¿¤¤Êý¤Ï [¤³¤Á¤é] ¤«¤é¼«Í³¤Ë¥À¥¦¥ó¥í¡¼¥É¤·¤ÆÄº¤±¤Þ¤¹

  • ¥Ç¡¼¥¿¤ÎÀµ³ÎÀ­¤ÏÊݾڤ·¤Þ¤»¤ó
  • ¤¤¤º¤ì¤âUAʸ»úÎó¤Î¸å¤í¤Ë¥¿¥Ö¶èÀÚ¤ê¤Ç¥Õ¥£¡¼¥ë¥É¤¬3¤ÄÆþ¤Ã¤Æ¤¤¤Þ¤¹¡£¥Õ¥£¡¼¥ë¥É¤Ï°Ê²¼¤ÎÄ̤ê¤Ç¤¹¡£
    • PC&¥¹¥Þ¡¼¥È¥Õ¥©¥ó¥Ö¥é¥¦¥¶¤Ï "Browser / (OS̾) / (¥Ö¥é¥¦¥¶Ì¾)"
    • ¥â¥Ð¥¤¥ë¥Ö¥é¥¦¥¶¤Ï "Mobile Browser / (¥­¥ã¥ê¥¢Ì¾) / (µ¡¼ï̾)"
    • ¥¯¥í¡¼¥é¤Ï "Crawler / (¶õ¥Õ¥£¡¼¥ë¥É) / (¥¯¥í¡¼¥é̾)"

³Ø½¬ºÑ¤ß¤ÎȽÄê´ï¤òÊݸ¤·¤Æ¤ª¤±¤ë¤è¤¦¤Ë¤¹¤ë

¾å¤Î¤è¤¦¤ÊÂçÎ̤Υǡ¼¥¿¤òÆÉ¤ß¹þ¤ó¤Ç¥ë¡¼¥ë¤òÀ¸À®¤¹¤ë¤Ë¤Ï¤½¤ì¤Ê¤ê¤Î»þ´Ö¤¬É¬Íפˤʤê¤Þ¤¹¡£¡£¥×¥í¥°¥é¥à¤òµ¯Æ°¤¹¤ë¤¿¤Ó¤Ë¥Ç¡¼¥¿ÆÉ¤ß¹þ¤ß¤È¥ë¡¼¥ëÀ¸À®¤ò¤·¤Æ¤¤¤Æ¤Ï¡¤¼ÂºÝ¤ÎȽÊ̽èÍý¤¬¤Ç¤­¤ë¤è¤¦¤Ë¤Ê¤ë¤Þ¤Ç¤Ë»þ´Ö¤¬¤«¤«¤Ã¤Æ¤·¤Þ¤¤¤Þ¤¹¡£

¤½¤³¤Ç¡¤³Ø½¬¥×¥í¥°¥é¥à¤¬ AI::DecisionTree ¤Î¥¤¥ó¥¹¥¿¥ó¥¹¤òStorable ¤Ë¤Æ¥Õ¥¡¥¤¥ë¤ËÊݸ¤·¡¤È½ÊÌ¥×¥í¥°¥é¥à¤Ï¤½¤³¤«¤é¥¤¥ó¥¹¥¿¥ó¥¹¤ò²òÅष¤Æ»È¤¦¤è¤¦¤Ë¥×¥í¥°¥é¥à¤òÆó¤Ä¤Ëʬ³ä¤·¤Þ¤¹¡£

½ÐÍè¾å¤¬¤Ã¤¿¤â¤Î¤¬¤³¤Á¤é¤Ç¤¹

  • build_tree.pl (Í¿¤¨¤é¤ì¤¿Î㤫¤é decision tree ¤ò¹½ÃÛ¤¹¤ë)
#!/usr/bin/perl

use strict;
use AI::DecisionTree;
use Storable qw(store_fd);

$| = 1;

my $N_TREES = 3;
my $FREEZER = 'frozen_dt.dat';

my @trees;
for my $i (1..$N_TREES) {
    push @trees, new AI::DecisionTree(
        prune => 1,
        noise_mode => 'pick_best',
    );
};

while (<>) {
    chomp;
    my ($ua, @results) = split(/\t/);
    return unless $ua;

    my $ua_fragments = breakdown_ua($ua);

    for my $i (0..$N_TREES-1) {
        next unless $results[$i];
        $trees[$i]->add_instance(
            attributes => {
                map {$_ => 1} @$ua_fragments,
            },
            result => $results[$i],
        );
    }
    print ".";
}

print "\ndone.\n";

open FH, ">$FREEZER";
for my $i (0..$N_TREES-1) {
    print "\ntraining tree $i\n";
    $trees[$i]->train;

    my @s = $trees[$i]->rule_statements;
    print "\n==============\n";print join "\n",@s;print "\n";

    store_fd $trees[$i], \*FH;
}
close FH;

sub breakdown_ua {
    my ($ua) = @_;

    $ua = lc($ua);
    my @ua_str = map { s/^\s+//;s/\s+$//;$_ } grep $_, split (/[,;\"\(\)]/, $ua);
    my @sub_ua_str;
    for (@ua_str) {
        push @sub_ua_str, grep { $_ !~ /^[0-9\._\-\+]*$/ } split(/[\s\/\-]/);
    }

    return [@ua_str, @sub_ua_str];
}
  • »ÈÍÑÎã
> cat *.tsv | build_tree.pl
  • decide_ua.pl (¤¢¤é¤«¤¸¤áÊݸ¤·¤¿ decision tree ¤ò»È¤Ã¤ÆUA¤ÎȽÄê¤ò¤¹¤ë)
#!/usr/bin/perl

use strict;
use AI::DecisionTree;
use Storable qw(fd_retrieve);

my $N_TREES = 3;
my $FREEZER = 'frozen_dt.dat';

my @trees;
open FH, $FREEZER;
for my $i (1..$N_TREES) {
    push @trees, fd_retrieve(\*FH);
}
close FH;

my $ua = $ARGV[0];
if ($ua) {
    my $ua_fragments = breakdown_ua($ua);
    for my $i (0..$#trees) {
        my $result = $trees[$i]->get_result(
            attributes => {
                map {$_ => 1} @$ua_fragments,
            }
        );
        print "$result\n";
    }
} else {
    for (@trees) {
        my @s = $_->rule_statements;
        print "\n==============\n";print join "\n",@s;print "\n";
    }
}

sub breakdown_ua {
    my ($ua) = @_;

    $ua = lc($ua);
    my @ua_str = map { s/^\s+//;s/\s+$//;$_ } grep $_, split (/[,;\"\(\)]/, $ua);
    my @sub_ua_str;
    for (@ua_str) {
        push @sub_ua_str, grep { $_ !~ /^[0-9\._\-\+]*$/ } split(/[\s\/\-]/);
    }

    return [@ua_str, @sub_ua_str];
}

¢¨ breakdown_ua ¥á¥½¥Ã¥É¤Ê¤É¡¤À¸À®¤ÈȽÄê¤ÎξÊý¤ÇɬÍפʽèÍý¡¦ÊÑ¿ô¤¬¤¤¤¯¤Ä¤«¤¢¤ê¤Þ¤¹¤Î¤Ç¡¤¼ÂÍѤ˻Ȥ¦¾ì¹ç¤ÏŬµ¹¶¦Ä̲½¤·¤Æ¤¯¤À¤µ¤¤¡£

  • »ÈÍÑÎã
> decide_ua.pl "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C)"
Browser
Windows 7
Internet Explorer/8.0
>

decide_ua.pl ¤Ï¡¤°ú¿ô¤òÍ¿¤¨¤¿¾ì¹ç¤Ï¤½¤ÎUAʸ»úÎó¤ËÂФ¹¤ëȽÄê·ë²Ì¤ò¡¤°ú¿ô¤¬¤Ê¤¤¾ì¹ç¤ÏÊݸ¤µ¤ì¤¿È½Äê¥ë¡¼¥ë¤ò¡¤¤½¤ì¤¾¤ìɽ¼¨¤·¤Þ¤¹¡£

¥ª¥Á

¤Ç¡¢¤³¤¦¤·¤Æ½ÐÍ褿UAȽÄê´ï¤ò¼ÂºÝ¤Î¶È̳¤Ç»È¤Ã¤Æ¤¤¤ë¤«¤È¤¤¤¦¤³¤È¤Ç¤¹¤¬¡¢·ë¶É»È¤Ã¤Æ¤¤¤Þ¤»¤ó¡£

³Î¤«¤Ë¡¤¥ë¡¼¥ë¤ò¼ê¤ÇºîÀ®¤¹¤ë¤Î¤Ç¤Ï¤Ê¤¯Îã¤òÎóµó¤·¤Æ¤¤¤¯¤À¤±¤Ç¤è¤¤¡¤¤È¤¤¤¦¤Î¤Ï¥á¥ó¥Æ¥Ê¥ó¥¹¤¬³Ú¤½¤¦¤Ê¤Î¤Ç¤¹¤¬¡¤¤½¤Î¤¿¤á¤Ë¤Ï¡Öweb¤«¤é´Êñ¤ËÎ㤬ÄɲäǤ­¤Æ¡¤ºÆ³Ø½¬¢ª¥Æ¥¹¥È¢ªËÜÈÖÈ¿±Ç¤¬¥Ü¥¿¥ó¤Ò¤È¤Ä¤Ç¤Ç¤­¤ë¡×¤È¤¤¤¦´ÉÍý¥·¥¹¥Æ¥à¤Þ¤Çºî¤é¤Ê¤¤¤È°ÕÌ£¤Ê¤¤¤·¡¤¤½¤³¤¬Èó¾ï¤ËÌÌÅݤ¯¤µ¤«¤Ã¤¿¤Î¤Ç¡Ä

¤Þ¤¿¡¤Î㤨¤Ð¥ë¡¼¥ë¥Ù¡¼¥¹¤Î¾ì¹ç¡¤¤³¤Î¤è¤¦¤Ë¥Ð¡¼¥¸¥ç¥óÈÖ¹æ¤äµ¡¼ï̾¤Î¼è¤ê½Ð¤·Êý¤òµ­½Ò¤·¤Æ¤ª¤¯¤³¤È¤Ç¡¢¿·¤·¤¯"MSIE 99.9" ¤Ê¤É¡¤¶µ»Õ¥Ç¡¼¥¿¤Ë¤Ê¤¤¿·´é¤¬½Ð¸½¤·¤¿¾ì¹ç¤Ë¤âÂбþ¤·Â³¤±¤ë¤³¤È¤¬¤Ç¤­¤Þ¤¹¡£

if ($ua =~ /MSIE\s+([0-9\.]+)/i) {
  $browser = 'Internet Explorer';
  $version = $1;
}

¤·¤«¤·¡¤¤³¤³¤Ç¼¨¤·¤¿¤è¤¦¤Ê decision tree Êý¼°¤À¤È¡¤¥Ð¡¼¥¸¥ç¥ó¤äµ¡¼ïËè¤ËÊÌ¡¹¤Î¶µ»Õ¥Ç¡¼¥¿¤¬É¬Íפˤʤê¤Þ¤¹¡£(²ÄÊÑÉôʬ¤òÁ´Éô "#" ¤ËÃÖ¤­´¹¤¨¤ë¤Ê¤É¤Î¹©Éפ¬½ÐÍ褽¤¦¤Êµ¤¤Ï¤¹¤ë¤±¤É¡Ä¡£)

¤È¤¤¤¦¤ï¤±¤Ç¡¤¸½¾õ¥ë¡¼¥ë¥Ù¡¼¥¹¤Ç¤Ê¤ó¤È¤«¤Ê¤Ã¤Æ¤¤¤ë+¤¤¤í¤¤¤íÌÌÅݤ¯¤µ¤¤¡¤¤Î¥³¥ó¥Ü¤Ç¤¤¤Þ¤Î¤È¤³¤í»î¤·¤¿¤À¤±¤Ç½ª¤ï¤ê¤Ë¤Ê¤Ã¤Æ¤¤¤ë¤Î¤Ç¤¹¤¬¡¤¤³¤Î decision tree ¤ÏUAȽÄê´ï°Ê³°¤Ë¤âÍÍ¡¹¤ÊÍÑÅӤ˻Ȥ¨¤Þ¤¹¤Î¤Ç¡¤¿´¤ÎÊÒ¶ù¤Ë¤ª¤¤¤Æ¤ª¤¯¤È¤­¤Ã¤È¤¤¤Ä¤«Ìò¤ËΩ¤Ä¤È¤­¤¬Íè¤ë¤È»×¤¤¤Þ¤¹¡£

(¤³¤Îµ­»öÃæ¤Î¡ÖÌÌÅݤ¯¤µ¤¤¡×¤Î½Ð¸½ÉÑÅÙ: 4²ó)