ã¦ã£ãããã£ã¢æ¥æ¬èªç tf-idfã®idfè¾æ¸ã®å ¬é
nora(éè¯)-idf-dic
ã¢ããã¼ã·ã§ã³
- LevelDB(kvs)ãå©ç¨ããçã¡ã¢ãªè¨è¨ã§ããã¹ã¦ã®Wikipediaã®ã³ã³ãã³ãã³ã³ãã³ããåå¾ãã¦å¦çããã
- XGBoostãElasticNetãªã©ä»ã®ã¢ã«ã´ãªãºã ã§ã®åå¦çã«ã楽ã«ããã
- JSONã¹ãã¼ããªã®ã§ãPython以å¤ã®ä»ã®ã¹ã¯ãªããè¨èªã§ãå©ç¨å¯è½ã«ããã
ãã©ã¼ããã
- idfã¯jsonã®dictåï¼ããã·ã¥ãããã¨ãè¨ãã¾ãï¼ã§ãã
idf = { term1: weight1, term2:weight2, ... }
ãã®ãããªãã©ã¼ãããã«ãªã£ã¦ãããåèªã¨idfã®éã¿ããã¢ã«ãªã£ã¦æ ¼ç´ããã¦ãã¾ãã
å¼ã®èª¬æ
- tf-idfã¯ãã¥ã¼ãªã¹ãã£ãã¯ãªãã®ãªã®ã§ãããããæ確ãªãã¦ãªããã§ãããä¸çªããããã¨ãããã®å¨ãã使ãå¼ã示ãã¾ãã
- çºå±çãªç¥èãå¿ è¦ãªãã°ãè±èªçWikipediaãè¦ã¦æ´ã«ã©ããããã¨ãªã®ãç解ãã¦ã¿ãã¨ããã§ãã
- ããã¾ãå ¨ä½ã«æ¸¡ã£ã¦åºç¾ããªãåèªã¯ãéè¦ã ããã¨ãã仮説ã«åºã¥ãã¦ãã¾ããããã¦ããã¯ãå¾ã ã«ãã¦ãã¾ãããã¾ãã
- tfã¯ããã¥ã¡ã³ãdã«æ¼ããtã®çºçé »åº¦ã§ãã
以ä¸ã¯å®éã«mecabçãã¤ã³ã¹ãã¼ã«ãã¦ãã¼ãããidfè¾æ¸ãä½æããä¾ãªã®ã§ãidfè¾æ¸ãå©ç¨ããã®ã¿ãªãåç §ããå¿ è¦ã¯ããã¾ããã
ããã¸ã§ã¯ãã®åå¾åå¾ã¨ãå¨è¾ºã½ããã¦ã§ã¢ã¦ã§ã¢ã®ã¤ã³ã¹ãã¼ã«
LevelDB(kvs)ã®ã¤ã³ã¹ãã¼ã«
(Ubuntu 16.04以ä¸ãæ³å®ãã¦ãã¾ã)
$ git clone https://github.com/google/leveldb.git $ cd leveldb $ make $ cd include $ sudo cp -r leveldb $ sudo cp -r leveldb/ /usr/local/include/ $ cd .. $ cd out-shared $ sudo cp lib* /usr/local/lib/ $ sudo ldconfig $ cd ~
$ sudo apt install mecab libmecab-dev mecab-ipadic $ sudo apt install mecab-ipadic-utf8
mecab-python3, plyvelã®ã¤ã³ã¹ãã¼ã«
$ git clone https://github.com/GINK03/tiny-japanese-wikipedia-tfidf-dic-generator $ sudo pip3 install mecab-python3 $ sudo pip3 install plyvel
NeoLogdã®ã¤ã³ã¹ãã¼ã«ãåã³è¾æ¸ã®æ¸ãæã
$ cd ~ $ git clone https://github.com/neologd/mecab-ipadic-neologd.git $ cd mecab-ipadic-neologd/ $ ./bin/install-mecab-ipadic-neologd [install-mecab-ipadic-NEologd] : Do you want to install mecab-ipadic-NEologd? Type yes or no. >yes $ sudo vi /etc/mecabrc (å )dicdir = /var/lib/mecab/dic/debian -> (å¤æ´å¾)dicdir = /usr/lib/mecab/dic/mecab-ipadic-neologd
Neologdã®ãã¹ã
$ echo "Fate/Grand Order" | mecab Fate/Grand Order åè©,åºæåè©,ä¸è¬,*,*,*,Fate/Grand Order,ãã§ã¤ãã°ã©ã³ããªã¼ãã¼,ãã§ã¤ãã°ã©ã³ããªã¼ ãã¼ EOS
åä½ç¢ºèª
$ cd ~ $ cd tiny-japanese-wikipedia-tfidf-dic-generator $ python3 nora-idf-dic.py (ä½ã表示ãããªãã°OK)
Wikipediaã®ãã³ãæ å ±ã®åå¾
Wikipediaã®ã¹ãããã·ã§ããã¨å¼ã°ããæ å ±ãåå¾ããå±éãã¾ãã
$ wget https://dumps.wikimedia.org/jawiki/20170201/jawiki-20170201-pages-articles-multistream.xml.bz2 $ bunzip2 jawiki-20170201-pages-articles-multistream.xml.bz2
idfè¾æ¸ãæ§ç¯ãã¾ãã
$ python3 nora-idf-dic.py --wakati (...60åã»ã©å¾ ã¡ã¾ã) $ ls title_context.ldb(ãã®ãã£ã¬ã¯ããªãããã°OK) $ python3 nora-idf-dic.py --build (...3åã»ã©å¾ ã¡ã¾ã) $ ls words_idf.json words_idf.jsonls
tf-idfã§ãã¯ãã«åãã
å ·ä½ä¾ãè¨ãã¦ããã¾ãã
$ echo "ããªãç©ããã£ãããã¾ã好ããããªãããã®ãªãã ã" | python3 nora-idf-dic.py --check {'ã': 4.926646596986834, 'ãªã': 2.042401886218362, 'ã ': 2.8119346405476735, 'ã': 1.2142350698667934, 'ãã': 6.054326132384362, 'ããªã': 5.476151075317936, 'ãã£ã': 8.627077870130083, 'ã': 3.364157726200682, 'ç©ã': 7.11635016692977, '好ã': 4.97306829447642, 'ããã®': 9.584680272531994, 'ãã¾ã': 5.093448481495583, 'ãª': 1.6713533531785785}
keyãæ°å¤ã¨ãã¦indexãæ¯ã£ã¦ããã°ãlibsvmãXGBoostãLightGBMã§å
¥åå¯è½ãªãã©ã¼ãããã«ãªãã¾ãã
å¥ã«ãã®ã¹ã¯ãªããçµç±ã§èªã¿åºãã®ã§ã¯ãªããjsonãã¡ã¤ã«ã ãèªã¿è¾¼ãã§ã好ããªããã«ä½¿ã£ã¦ããã ãã¦æ§ãã¾ããã
ã³ã¼ã
- Wikipediaã¯å·¨å¤§ãªã³ã¼ãã¹ãªã®ã§ã¾ã¨ãã«ã¯ãªã³ã¡ã¢ãªã§ã¯å¦çã§ãã¾ãããã¡ã¢ãªã«åã¾ããªãæä½ãã³ãã³ãã¨kvsã使ããªãããªãã¨ãããã¨ããæãã§ãã
- githubã«ããã¦ããã¾ããã
ã©ã¤ã»ã³ã¹ã»ãã®ä»
- Text of Creative Commons Attribution-ShareAlike 3.0 Unported Licenseã¨ããã©ã¤ã»ã³ã¹ã«æºæ
- Wikipediaãåç §ãã¦ãã ãã
Wikipedia:クリエイティブ・コモンズ 表示-継承 3.0 非移植 - Wikipedia
- Wikipediaã®ãã¼ã¿ã¯2017/02/01æç¹ã®ã¹ãããã·ã§ããã§ã
- å½¢æ ç´ è§£æã¨ã³ã¸ã³ã«MeCab + NeoLogd(2016/12æç¹)ãå©ç¨ãã¾ãã