Stanford CoreNLP ã Python ãã使ãæ¹æ³ã¾ã¨ã
èªç¶è¨èªå¦çã©ã¤ãã©ãªãStanford CoreNLPãã Python ãã使ããããã«ããããã®ã¤ã³ã¹ãã¼ã«æ¹æ³ã¨ï¼å®éã®ä½¿ãæ¹ãã¾ã¨ãã¾ããï¼
Stanford CoreNLP ã¨ã¯ï¼
Stanford CoreNLPã¨ã¯ï¼è±èªãã¯ããã¨ããããã¹ãã®èªç¶è¨èªå¦çï¼NLPï¼ç¨ã®ã©ã¤ãã©ãªã§ãï¼
åè©ã¿ã°ä»ãï¼åºæ表ç¾æ½åºï¼æ§æ解æãªã©ï¼NLP ã«ä¸éãã®æ©è½ãåãã¦ãã¾ãï¼
è±èªã®èªç¶è¨èªå¦çãè¡ãå ´åã¯ã¨ããããããã使ã£ã¦ããã°ééããªãã¨æãã¾ãï¼
ãªãï¼æ¥æ¬èªã«ã¯æ®å¿µãªãã対å¿ãã¦ãã¾ããï¼
詳細ã¯å
¬å¼ãµã¤ããåç
§ãã¦ä¸ããï¼è±èªã§ãï¼ï¼
CoreNLPã¯Javaã§å®è£
ããã¦ããã®ã§ããï¼æ§ã
ãªè¨èªãã使ããã©ããã¼ãç¨æããã¦ãã¾ãï¼
Pythonãã使ãããå ´åã¯ãstanford_corenlp_pywrapperãã¨ããã©ã¤ãã©ãªã使ç¨ãã¾ãï¼
以ä¸ã§ã¯ï¼CoreNLPã®ã¤ã³ã¹ãã¼ã«ããPythonã§ã®ä½¿ç¨æ¹æ³ã説æãã¾ãï¼
ã¤ã³ã¹ãã¼ã«
以ä¸ï¼CentOS6 ã¸ã®ã¤ã³ã¹ãã¼ã«æé ã§ãï¼
åºæ¬çã«è¼ãã¦ããã³ãã³ããã¿ã¼ããã«ã§é çªã«å®è¡ããã ãã§ã¤ã³ã¹ãã¼ã«ã§ããã¯ãã§ãï¼
Java8 ã®ã¤ã³ã¹ãã¼ã«ï¼CentOS6ï¼
CoreNLP 㯠Java8 以ä¸ã§ãªãã¨åä½ããªãã®ã§ã¤ã³ã¹ãã¼ã«ãã¾ãï¼
æ¢ã« Java8 ãã¤ã³ã¹ãã¼ã«ããã¦ããå ´åã¯ã¹ããããã¦ä¸ããï¼
åºæ¬çã«ã¯ä»¥ä¸ã®ãã¼ã¸ã«è¼ã£ã¦ããéãã«è¡ãã¾ãï¼
ã¾ãã¯æ§ãã¼ã¸ã§ã³ãã¢ã³ã¤ã³ã¹ãã¼ã«ãã¾ãï¼
# yum remove java-1.6.0-openjdk # yum remove java-1.7.0-openjdk
Java8 ãã¤ã³ã¹ãã¼ã«ãã¾ãï¼
# wget --no-check-certificate --no-cookies - --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u51-b16/jdk-8u51-linux-x64.rpm # rpm -ivh jdk-8u51-linux-x64.rpm # rm -rf jdk-8u51-linux-x64.rpm\?AuthParam=1437734086_72d1fd06e1b9dd5b73e0c2affcd52895
java -version
ãå®è¡ãã¦ä»¥ä¸ã®ããã«ãã¼ã¸ã§ã³ã表示ããããï¼Java8 ã®ã¤ã³ã¹ãã¼ã«ã¯å®äºã§ãï¼
$ java -version java version "1.8.0_51" Java(TM) SE Runtime Environment (build 1.8.0_51-b16) Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)
CoreNLP ã®ã¤ã³ã¹ãã¼ã«
ã¤ã³ã¹ãã¼ã«ã¨è¨ã£ã¦ã jar ãã¡ã¤ã«çä¸å¼ãDLããã ãã§ãï¼
æ¬è¨äºå·çæç¹ã§ã®ææ°ç 3.5.2 ãã¤ã³ã¹ãã¼ã«ãã¾ãï¼
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-04-20.zip $ unzip stanford-corenlp-full-2015-04-20.zip $ rm -rf stanford-corenlp-full-2015-04-20.zip
Python Wrapper ã®ã¤ã³ã¹ãã¼ã«
CoreNLP ã Python ãã使ãããã®ã¢ã¸ã¥ã¼ã«ãstanford_corenlp_pywrapperããã¤ã³ã¹ãã¼ã«ãã¾ãï¼
GitHub ãã ã¯ãã¼ã³ã㦠pip
ã§ã¤ã³ã¹ãã¼ã«ãã¾ãï¼
# git clone https://github.com/brendano/stanford_corenlp_pywrapper # cd stanford_corenlp_pywrapper # pip install .
Python ãèµ·åãï¼stanford_corenlp_pywrapper ãã¤ã³ãã¼ããã¦ãã¨ã©ã¼ãåºãªããã°ã¤ã³ã¹ãã¼ã«ã¯å®äºã§ãï¼
$ python >>> from stanford_corenlp_pywrapper import CoreNLP
使ãæ¹
ã¨ãããã使ã£ã¦ã¿ã¾ãï¼
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP("pos", corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
â» /foo/bar/stanford-corenlp-full-2015-04-20/*
ã®é¨åã¯ï¼å
ç¨ CoreNLP ã®ãã¡ã¤ã«ä¸å¼ã 解åãã¦çæããããã£ã¬ã¯ããªãæå®ãã¦ä¸ããï¼
æ£ããå®è¡ã§ããã°ï¼ä»¥ä¸ã®ããã«è§£æçµæã表示ãããã¯ãã§ãï¼
{u'sentences': [ {u'tokens': [u'hello', u'world', u'.'], u'lemmas': [u'hello', u'world', u'.'], u'pos': [u'UH', u'NN', u'.'], u'char_offsets': [[0, 5], [6, 11], [11, 12]] }, {u'tokens': [u'how', u'are', u'you', u'?'], u'lemmas': [u'how', u'be', u'you', u'?'], u'pos': [u'WRB', u'VBP', u'PRP', u'.'], u'char_offsets': [[13, 16], [17, 20], [21, 24], [24, 25]] } ] }
ä¸è¨ã®ããã«ï¼åºæ¬çãªä½¿ãæ¹ã¯è³ã£ã¦ã·ã³ãã«ã§ï¼CoreNLP
ã®ã¤ã³ã¹ã¿ã³ã¹ãçæãï¼CoreNLP.parse_doc
ã¡ã½ããã§ãã¼ã¹ï¼è§£æï¼ããã ãã§ãï¼
解æçµæã¯ãè¾æ¸åãã§è¿ã£ã¦ããã®ã§ï¼å¾ã¯å¿
è¦ãªçµæã dict['sentences'][0]['tokens'][0]
ã®ãããªæãã§åãåºãã¦ä¸ããï¼
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP("pos", corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> dict = proc.parse_doc("hello world. how are you?") >>> dict['sentences'][0]['tokens'][0] u'hello'
ã¾ãï¼CoreNLP ã¤ã³ã¹ã¿ã³ã¹çææã« configdict
ãªãã·ã§ã³ã渡ããã¨ã§ï¼è§£æãã¦æ¬²ããè¦ç´ ãæå®ãããã¨ãã§ãã¾ãï¼
ä¸å¿
è¦ãªè§£æãçããã¨ã§è§£ææéãç縮ã§ãã¾ãï¼
ãã ãï¼ä¾åé¢ä¿ã«ã¯æ°ãä»ããå¿
è¦ãããã¾ãï¼ãåè©ã¿ã°ä»ãããè¡ãã«ã¯ãåèªåå²ããè¡ãå¿
è¦ãããï¼çï¼ï¼
以ä¸ã«ä¾ã示ãã¦ããã®ã§åèã«ãã¦ä¸ããï¼
åèªåå²
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
åè©ã¿ã°ä»ã
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit,pos'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
â» ã¿ã°ã®æå³ã¯ Penn Treebank P.O.S. Tags ãåç §ãã¦ä¸ããï¼
ã¬ãã¿ã¤ã¼ã¼ã·ã§ã³ (lemmatization)
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit,pos,lemma'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
åºæ表ç¾æ½åº
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit,pos,lemma,ner'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
æ§æ解æ
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit,parse'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
çµã¿åããã
ä¸è¨ãçµã¿åããããã¨ãã§ãã¾ãï¼
ä¾ãã°ï¼ä»¥ä¸ã¯ãåè©ã¿ã°ä»ããã¨ãæ§æ解æããè¡ãã¾ãï¼
$ python >>> from stanford_corenlp_pywrapper import CoreNLP >>> proc = CoreNLP(configdict={'annotators': 'tokenize,ssplit,pos,parse'}, corenlp_jars=["/foo/bar/stanford-corenlp-full-2015-04-20/*"]) >>> proc.parse_doc("hello world. how are you?")
ã¾ã¨ã
ã¨ã¾ãï¼CoreNLPã使ãã°è±æãè²ã ãªè§åº¦ãã解æãããã¨ãã§ãã¾ãï¼
CoreNLP ã«ã¯ä»ã«ãè²ã
ãªæ©è½ãããã¾ãï¼
詳細ã¯ä»¥ä¸ã«ç¤ºãåèãã¼ã¸ãåç
§ãã¦ã¿ã¦ä¸ããï¼
åèãã¼ã¸
The Stanford Natural Language Processing Group
CoreNLPã®å
¬å¼ãã¼ã¸ï¼ãUsing the Stanford CoreNLP APIãã«è§£æå¯è½ãªå
¨é
ç®ãæ²è¼ããã¦ããï¼
GitHub - brendano/stanford_corenlp_pywrapper
stanford_corenlp_pywrapperã®å
¬å¼ãã¼ã¸ï¼è©³ãã使ãæ¹ããã¡ãï¼
Penn Treebank P.O.S. Tags
POSã¿ã° (NNP ã¨ã VBD ã¨ã) ã®æå³ãåãããªãæã¯ãããåç
§ããã¨è¯ãï¼
Stanford CoreNLP Online Demo
CoreNLP ããã©ã¦ã¶ä¸ã§è©¦ããï¼