ã¾ã¨ããªç°¡ä½åã»ç¹ä½åå¤æ
ã¾ã¨ããªç°¡ä½åã»ç¹ä½åå¤æï¼ä»¥ä¸ãç°¡ç¹å¤æï¼ãä½ãã¾ããã*1
ãªãã ã¾ã¨ãã¨è¨ãã®ãï¼
ããã¯ãç°¡ç¹å¤æã¨ããã®ã¯ä¸å¯¾å¤å¤æã§ãã£ã¦ããããæ£ããã§ãã¦ããªãï¼ãããã¨ããã¦ããªãï¼å¤æã¨ããã®ã¯ã¾ã¨ãã§ã¯ãªãããã§ãã
ã¾ã¨ãã§ãªãç°¡ç¹å¤æ
ä¾ãã°ãæ¥æ¬èªã«ãããåèªã§ä¾ãæããã¨ããä¹¾ç¥ããå¹¹é¨ããå¹²æ¶ãã¨ãããã®ãããã¾ãã
ç°¡ä½åã§ã¯ãä¹¾ããå¹¹ãã¯ãå¹²ãã«ãªãã®ã§ããããã¯ãå¹²ç¥ããå¹²é¨ããå¹²æ¶ãã¨æ¸ããã¾ãã
ããããç¹ä½åã«å¤æããã¨ããä¹¾ç¥ããå¹¹é¨ããå¹²æ¶ãã«æ»ã£ã¦ã»ããã¨ããã§ãã
ãããããç°¡ä½å ç¹ä½å å¤æãã¨æ¤ç´¢ãã¦ä¸ä½ã«åºã¦ãããµã¤ãã§ããããå¤æãã¦ããã ããããã¾ãããã¾ããã
ãããµã¤ãã§ã¯ããå¹¹ç¥ããå¹¹é¨ããå¹¹æ¶ãã¨ãªãã¾ãããå¹²âå¹¹ãã¨ããåç´ãªç½®ãæããããã¦ããªãã¨ãããã¨ã§ãã
ã¾ããå¥ã®ãµã¤ãã§ã¯ãä¹¾/å¹¹/榦ç¥ããä¹¾/å¹¹/榦é¨ããä¹¾/å¹¹/榦æ¶ãã¨ãªãã¾ããç°¡ä½åã«å¯¾å¿ããç¹ä½åãè¤æ°ããããã¨ãããã¨ã¾ã§ã¯èªèãã¤ã¤ãæ£ãããã®ãé¸ã¶æè¡ã¯ãªãã¨ãããã¨ã§ãã*2
ãµã¤ãã«ãã£ã¦ã¯ããä¹¾ç¥ããå¹¹é¨ããå¹²æ¶ãã¨æ£ããå¤æããã¾ãããã ãããã®ãããªãµã¤ãã§ããä¾ãã°ä¸å½èªã®ãè½ï¼ã§ããï¼ããåã«ã¤ãã¦å¤æããã¨ããè½å¹²ç¥ãããè½å¹¹ç¥ãã«ãªã£ã¦ãã¾ãã¾ããããã¯ãä¸å½èªã«ãè½å¹¹ãã¨ããåèªããã£ã¦ããããå ã«ããããã¦ãã¾ã£ã¦ããããã§ãã
ãããã誤å¤æã«å¯¾å¿ããããã«ã ä¸å½èªã®Wikipediaã§ã¯2万行以上あるリストãã¡ã³ããã³ã¹ãã¦ãã¾ãã
ãé¢å ãã¯ã麵å ããä¸æ¹ã§ãé¢å æ¬ããªããé¢å æ¬ããã®ã¾ã¾â¦ã
ãããªã«ã¼ã«ãæ°éããªãããããã§ãã
ã§ããã°ããããªãã®ãæã§è§¦ã£ããã¯ããããªãã¨ããã§ãã
ã¾ã¨ããªç°¡ç¹å¤æ
ããã§éçºããã®ããæåã«ç´¹ä»ãã簡繁変換ã§ãã
ãã®ãã¼ã¸ã§ã¯ãä¾ãã°ãè½å¹²ç¥ãããè½ä¹¾ç¥ãã¨æ£ããå¤æã§ãã¾ãã
æè¡
ãã®ç°¡ç¹å¤æã¯ãN-gramããã¼ã¹ã«ãã¦ãã¾ãã
N-gramèªä½ã¯ä¸è¬çãªæè¡ãªã®ã§ãããã§ã®èª¬æã¯çç¥ãã¾ãã
å¤æé¨åã¯ããã£ã¨æã«è¨äºãæ¸ãã可変次数N-gramデコードã使ã£ã¦ãã¾ãã
ã½ã¼ã¹ã¯https://github.com/hiroshi-manabe/jfconv-scriptsã§å ¬éãã¦ãã¾ãã
デコード部分ã®ã¢ã«ã´ãªãºã ã¯ãKenLMã¨ããN-gramã©ã¤ãã©ãªã®ç¶æ ï¼Stateï¼ã使ããã¨ã§ããªãã·ã³ãã«ã«ãªã£ã¦ãã¾ãã
å¦çæã«ã¯ãä¾ãã°ãå¹²é¢ãã¨ããå
¥åã§ããã°ãããã[ä¹¾|å¹²|å¹¹|榦] [é¢|麵]ãã¨ããå½¢ã«å¤æãããããN-gramãã³ã¼ããåãåã£ã¦ãæãããããã並ã³ãé¸ã¶ãã¨ããå½¢ã«ãªã£ã¦ãã¾ãã
ãã¼ã¿
ç§ã使ã£ãã®ã¯ãç°¡ä½åã¯https://github.com/brightmart/nlp_chinese_corpusã§ç´¹ä»ããã¦ããç¾ç§åçï¼Q&Aãµã¤ãï¼ãç¹ä½åã¯å°æ¹¾ã®ãããããªå°èª¬ãµã¤ãããã¯ãã¼ãªã³ã°ãããã®ã§ãã
èæ¯
ããªãã¾ã¨ããªç°¡ç¹å¤æãå°ãªãã®ããã¨ããçåãæã¤äººãããããããã¾ããã
ããã¯ãç°¡ç¹å¤æã¨ããã®ããããã¾ãå¿ è¦ããªãã¿ã¹ã¯ãã ããã§ãã
ç°¡ä½ååã®äººã¯ã ãããç¹ä½åãèªãã¾ããããã®éãã¾ãããã§ãã
ãã¡ãããèªåã®æ £ããæåã®ã»ããèªã¿ãããã®ã§ç°¡ç¹å¤æã¨ãããã®ãããã®ã§ããã人éã®è³ã¯é©å¿è½åãé«ãã®ã§ãå¤å°å¤æãééã£ã¦ãã¦ãè£å®ãã¦èªããã¨ãã§ãã¾ãã
ããããæå³ã§ã¯ãç°¡ç¹å¤æã¨ããã®ã¯ãã¡ã¸ã£ã¼ãªèªç¶è¨èªå¦çã¿ã¹ã¯ï¼ç¿»è¨³ãé³å£°èªèãé³å£°åæçï¼ã¨éã£ã¦ãçé¢ç®ã«ããåæ©ã«ä¹ããã¿ã¹ã¯ãªã®ã§ãã
ããããããã§ã大è³æ¬ãçé¢ç®ã«åãçµãã¨ãããã¨ããªãã®ã§ãå人ã§é å¼µãã°æ¯è¼çãããã®ãä½ããã¨ãããã¨ã«ãªãã¾ãã
ã¨ãã£ã¦ããããã¾ãå¿ è¦ããªãã¿ã¹ã¯ãã§ãããã¨ã«å¤ããã¯ãªãã®ã§ãèªå·±æºè¶³ã®ãããªãã®ã§ããã
ãã®N-gramãã³ã¼ãã¯ã¡ãã£ã¨ããNLPã¿ã¹ã¯ã解ãã®ã«ä¾¿å©ãªã®ã§ãä»ã«ãããã¤ããã¼ã«ãä½ã£ã¦ã¿ãäºå®ã§ãã