ã¯ããã«
ãã¼ãããããã³ããªããã¼ãï¼ð1 nikkieã§ãã
åºæ表ç¾æ½åºï¼NERï¼ã¿ã¹ã¯ãCRFï¼Conditional Random Fields2ï¼ã§è§£ãå®è£ ã®ç解ãæ·±ãããããã¥ã¼ããªã¢ã«ã§ç´ æ¯ããã¾ããã
ç®æ¬¡
- ã¯ããã«
- ç®æ¬¡
- Hironsanã«ãããã¥ã¼ããªã¢ã«
- ç´ æ¯ãææç©
- ãã¥ã¼ããªã¢ã«ã®æ¦è¦
- çµããã«
Hironsanã«ãããã¥ã¼ããªã¢ã«
ç´ æ¯ãã«é¸ãã ãã¥ã¼ããªã¢ã«ã¯ãã¡ãã
Hironsanï¼ä¸å±±å 樹ããï¼ã¯æ©æ¢°å¦ç¿ãPythonæ¬ã®èè ã»è¨³è 3ã§ãããdoccanoã®é常ã«æ´»åçãªã³ã³ããªãã¥ã¼ã¿ã¼ã¨ãèªèãã¦ãã¾ãã
ä¿¡é ¼ã§ããæ¹ãéå»ã«æ¸ãããã¥ã¼ããªã¢ã«ã§ãããQiitaä¸ã§ããããã¹ããã¯ãå¤ãï¼500è¶ ãï¼ãCRFã§NERã解ãã¨ãããã³ãã·ã£ãªå 容ã ã£ãã®ã§ãæåã«è§¦ããã¥ã¼ããªã¢ã«ã¨ãã¦é¸ã³ã¾ããã
ç´ æ¯ãææç©
1ã¤ã®å·¨å¤§ã¹ã¯ãªããã«ããã«ãã¢ã¸ã¥ã¼ã«åå²ããã®ã工夫ç¹ã§ãã
åä½ç°å¢4
ãã¥ã¼ããªã¢ã«ã®æ¦è¦
使ããã¼ã¿
Hironsanä½æã®ã©ãã«ä»ããã¼ã¿ã使ãã¾ã
README.mdãã
hironsan.txtã¯ãã¦ã£ããã¥ã¼ã¹æ¥æ¬èªçãMeCabã§å½¢æ ç´ è§£æãã¦IOB2ã¿ã°ã§ã¿ã°ä»ãããã³ã¼ãã¹ã§ãã
å ¨é¨ã§500æã«ã¿ã°ä»ããã¦ãã¾ãã
ç´ æ§ï¼ç¹å¾´éï¼æ½åº
CRFã¯ãã£ã¼ãã©ã¼ãã³ã°ããåã®æ©æ¢°å¦ç¿ã¢ãã«ã§ããã人æã§ç¹å¾´éãä½ãå¿
è¦ãããããããæ§è½ãå·¦å³ããã¨ããèªèã§ãã
ç´ æ§æ½åºãã¼ãã§ç¹å¾´éãä½ã£ã¦ããã¾ãã
ãã¥ã¼ããªã¢ã«ã®ãæ¦è¦ããã
ä»åã¯ãåå¾ï¼æåã®åèªãåè©ç´°åé¡ãæå種ãåºæ表ç¾ã¿ã°ã使ãã¾ãã
å
·ä½çãªãã¼ã¿ã§è¦ã¦ããã¾ãããã
hironsan.txtã®1æç®ã§ãã
2005 åè© æ° * * * * * B-DAT å¹´ åè© æ¥å°¾ å©æ°è© * * * å¹´ ãã³ ãã³ I-DAT 7 åè© æ° * * * * * I-DAT (çç¥) ã å©åè© * * * ç¹æ®ã»ã¿ åºæ¬å½¢ ã ã¿ ã¿ O ã è¨å· å¥ç¹ * * * * ã ã ã O
以ä¸ã®ããã«èªã¿è¾¼ã¾ãã¦ãã¾ãã
>>> train_sents[0][0] ['2005', 'åè©', 'æ°', '*', '*', '*', '*', '*', 'B-DAT'] >>> train_sents[0][1] ['å¹´', 'åè©', 'æ¥å°¾', 'å©æ°è©', '*', '*', '*', 'å¹´', 'ãã³', 'ãã³', 'I-DAT'] >>> train_sents[0][2] ['7', 'åè©', 'æ°', '*', '*', '*', '*', '*', 'I-DAT'] >>> train_sents[0][-1] ['ã', 'è¨å·', 'å¥ç¹', '*', '*', '*', '*', 'ã', 'ã', 'ã', 'O']
ä½ã£ãç´ æ§ã¯ãã¡ãã
1èªç®ã®ã2005ãã¯å è¡ããæåããªããå¾ã«ç¶ããå¹´ãã¨ã7ãããç´ æ§ãä½ããã¾ãã
>>> pprint(X_train[0][0]) ['bias', 'word=2005', 'type=ZDIGIT', 'postag=åè©-æ°', 'BOS', 'BOS', '+1:word=å¹´', '+1:type=OTHER', '+1:postag=åè©-æ¥å°¾-å©æ°è©', '+2:word=7', '+2:type=ZDIGIT', '+2:postag=åè©-æ°']
3èªç®ã®ã7ãã¯ãå è¡ããã2005ããå¹´ããå¾ã«ç¶ããæãã14ãããç´ æ§ãä½ããã¾ãã
>>> pprint(X_train[0][2]) ['bias', 'word=7', 'type=ZDIGIT', 'postag=åè©-æ°', '-2:word=2005', '-2:type=ZDIGIT', '-2:postag=åè©-æ°', '-2:iobtag=B-DAT', '-1:word=å¹´', '-1:type=OTHER', '-1:postag=åè©-æ¥å°¾-å©æ°è©', '-1:iobtag=I-DAT', '+1:word=æ', '+1:type=OTHER', '+1:postag=åè©-ä¸è¬', '+2:word=14', '+2:type=ZDIGIT', '+2:postag=åè©-æ°']
æã®æå¾ã®èªã®ãããã¯ãå¾ã«ç¶ãèªããªãã®ã§ãå è¡ãã2èªããããããããç´ æ§ãä½ããã¾ãã
>>> pprint(X_train[0][-1]) ['bias', 'word=ã', 'type=OTHER', 'postag=è¨å·-å¥ç¹', '-2:word=ã', '-2:type=HIRAG', '-2:postag=åè©-æ¥å°¾', '-2:iobtag=O', '-1:word=ã', '-1:type=HIRAG', '-1:postag=å©åè©', '-1:iobtag=O', 'EOS', 'EOS']
åçµããå®è£
ã¯feature_engineering.pyã«ããã¾ãã
åçµããä¸ã§ç§ã¯ã»ã¨ãã©åãã³ã¼ããã©ããã¦ãä½åº¦ãæ¸ããããªãã£ãã®ã§ãå°ãé¢æ°åããã¨ãã工夫ããã¾ããï¼30åã§åããããã«ã³ããããã®ãå
¨ç¶ããã¨æãã¾ãï¼ã
CRFsuiteãè¨ç·´
ä»åã¯python-crfsuite
ã使ã£ã¦ãã¾ãã
Trainer
ãTagger
ã®æ±ãã¯ï¼scikit-learnã®ã¤ã³ã¿ãã§ã¼ã¹ã«æ
£ãã身ããããã¨ï¼ç¬ç¹ã§ãã
- Trainer5
append
ã¡ã½ããã§ãã¼ã¿ãããããset_params
ã¡ã½ããã§ãã¤ãã¼ãã©ã¡ã¿æå®train
ã§è¨ç·´ï¼fitï¼ã渡ãããã¹ã«ä¿åã§ãã
- Tagger6
- åæåãã¦ãã
open
ã§ãã¡ã¤ã«ããèªã¿è¾¼ã tag
ã¡ã½ããã§æ¨è«ï¼predictï¼
- åæåãã¦ãã
ã¢ãã«ã®è©ä¾¡ãçµæã®åç¾ï¼
ãã¥ã¼ããªã¢ã«ã®çµæã¯åç¾ãã¾ããï¼
precision recall f1-score support B-ART 1.00 0.89 0.94 9 I-ART 0.92 1.00 0.96 12 B-DAT 1.00 1.00 1.00 12 I-DAT 1.00 1.00 1.00 22 B-LOC 1.00 0.95 0.97 55 I-LOC 0.94 0.94 0.94 17 B-ORG 0.75 0.86 0.80 14 I-ORG 1.00 0.90 0.95 10 B-PSN 0.00 0.00 0.00 3 B-TIM 1.00 0.71 0.83 7 I-TIM 1.00 0.81 0.90 16 micro avg 0.96 0.91 0.94 177 macro avg 0.87 0.82 0.84 177 weighted avg 0.95 0.91 0.93 177 samples avg 0.14 0.14 0.14 177
scikit-learn
ã®ãã¼ã¸ã§ã³ãæ°ããããæ«å°¾ã®avgã®è¡ãå¤ãã®ã ã¨æãã¾ãã
ãã¥ã¼ããªã¢ã«ã¨ã¯ãweighted avgãã®è¡ãä¸è´ãã¾ããã
çµããã«
CRFã§åºæ表ç¾æ½åºãããã¥ã¼ããªã¢ã«ã§ç´ æ¯ããã¾ããã
éè¤ã³ã¼ããæ¸ãããå®è£
ã工夫ãã¤ã¤ããã¥ã¼ããªã¢ã«ã®çµæãåç¾ããã®ã§æºè¶³ã§ãã
ä¸éãåãããã«ãªã£ãã³ã¼ããæå ã«ããã®ã§ãç解ãæ·±ããããã«æ¹é ã試ãã¦ãããã¨æãã¾ãï¼ã©ãã次åãããã¾ãããã«ï¼
- 㤠↩
- ãæ¡ä»¶ä»ã確çå ´ãã¨å¼ã°ãã¾ãããhttps://ja.wikipedia.org/wiki/%E6%9D%A1%E4%BB%B6%E4%BB%98%E3%81%8D%E7%A2%BA%E7%8E%87%E5%A0%B4↩
- èæ¸ãæ©æ¢°å¦ç¿ã»æ·±å±¤å¦ç¿ã«ããèªç¶è¨èªå¦çå ¥éãï¼åºæ表ç¾æ½åºãæ±ã£ãç« ããã¾ãï¼ã訳æ¸ãæ©æ¢°å¦ç¿ã¨ã³ã¸ãã¢ã®ããã®Transformersããç´è¿ã§ã¯ããã¤ããã©ã¼ãã³ã¹Python 第2çã↩
- 詳細 https://github.com/ftnext/ml-playground/blob/89f3f277c4cd998dbbd58af756d2a0fbaaf072a2/crf/hironsan-tutorial/requirements.lock↩
- https://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.Trainer↩
- https://python-crfsuite.readthedocs.io/en/latest/pycrfsuite.html#pycrfsuite.Tagger↩