Apache Mahout 機械学習Libraryを使って「魔法少女まどか☆マギカ」の台詞をテキストマイニングしてみた
- 作者: Sean Owen,Robin Anil,Ted Dunning,Ellen Friedman
- 出版社/メーカー: Manning Pubns Co
- 発売日: 2011/10/28
- メディア: ペーパーバック
- 購入: 4人 クリック: 81回
- この商品を含むブログ (10件) を見る
Index
Information & Links
この記事は「魔法少女まどか☆マギカ」の台詞を機械学習によりClusteringした内容についての記録です。Clustering/Graph Image出力などの実験/検証が不十分であるため後日再挑戦しますができたところまで公開します。以下はこの記事で参考にしたリンクです。
- 「魔法少女まどか☆マギカ」の台詞をJavaScriptでMapReduceしてGoogle Chart APIでグラフ出力したよ! - Yuta.Kikuchiの日記
- 試すのが難しい―機械学習の常識はMahoutで変わる (1/3) - @IT
- Apache Mahout の紹介
- Mahout JP
- テキストマイニングで始める実践Hadoop活用
- Overview (Hadoop 0.20.2 API)
- Open Source Java Code Online - JavaSourceCode
- Jar File Download examples (example source code) Organized by topic
Apache Mahout
Apache Mahout
Apache Mahout: Scalable machine learning and data mining
Mahout(マハウト)と呼ぶApache Hadoopと利用できるスケーラブルな機械学習ライブラリです。協調学習やユーザレコメンド、k-meansなどの機械学習のライブラリ機能を持っているためそれらを簡単に利用できます。Mahout has machine learning libraries
Mahoutには以下の機械学習ライブラリが備わっています。
- Collaborative Filtering
- User and Item based recommenders
- K-Means, Fuzzy K-Means clustering
- Mean Shift clustering
- Dirichlet process clustering
- Latent Dirichlet Allocation
- Singular value decomposition
- Parallel Frequent Pattern mining
- Complementary Naive Bayes classifier
- Random forest decision tree based classifier
- High performance java collections (previously colt collections)
- A vibrant community
- and many more cool stuff to come by this summer thanks to Google summer of code
Mahout Download / Setting
http://ftp.meisei-u.ac.jp/mirror/apache/dist/mahout/
ここからMahoutをダウンロード/インストールします。ここでの説明は既にHadoopがインストールされている事を前提としています。Hadoopの設定についてはCentOSでHadoopを使ってみる - Yuta.Kikuchiの日記 を参照してください。基本的にはmahout本体の展開だけで動くと思いますが、Mavenが必要な場合はインストールしてください。尚今回インストールした環境はCentos5.7になります。Mahoutを動かすにはJAVA_HOME、HADOOP_HOME、HADOOP_CONF_DIRの環境設定が必要なので.zshrcに加えておきます。$ cat /etc/redhat-release CentOS release 5.7 (Final) $ wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/mahout/0.6/mahout-distribution-0.6.tar.gz $ tar -xzf mahout-distribution-0.6.tar.gz $ file mahout-distribution-0.6/bin/mahout //実行コマンド mahout-distribution-0.6/bin/mahout: Bourne-Again shell script text executable $ cd mahout-distribution-0.6 $ ls -al -rw-r--r-- 1 yuta yuta 39588 2月 1 22:30 LICENSE.txt -rw-r--r-- 1 yuta yuta 1888 2月 1 22:30 NOTICE.txt -rw-r--r-- 1 yuta yuta 1200 2月 1 22:30 README.txt drwxr-xr-x 2 yuta yuta 4096 4月 8 19:15 bin drwxr-xr-x 3 yuta yuta 4096 4月 8 19:15 buildtools drwxr-xr-x 2 yuta yuta 4096 2月 1 22:29 conf drwxr-xr-x 3 yuta yuta 4096 4月 8 19:15 core drwxr-xr-x 3 yuta yuta 4096 4月 8 19:15 distribution drwxr-xr-x 6 yuta yuta 4096 4月 8 19:15 docs drwxr-xr-x 5 yuta yuta 4096 4月 8 19:15 examples drwxr-xr-x 3 yuta yuta 4096 4月 8 19:15 integration drwxr-xr-x 2 yuta yuta 4096 4月 8 19:15 lib -rw-r--r-- 1 yuta yuta 11190212 2月 1 22:31 mahout-core-0.6-job.jar -rw-r--r-- 1 yuta yuta 1662876 2月 1 22:31 mahout-core-0.6.jar -rw-r--r-- 1 yuta yuta 23593299 2月 1 22:33 mahout-examples-0.6-job.jar -rw-r--r-- 1 yuta yuta 379461 2月 1 22:33 mahout-examples-0.6.jar -rw-r--r-- 1 yuta yuta 284781 2月 1 22:32 mahout-integration-0.6.jar -rw-r--r-- 1 yuta yuta 288914 2月 1 22:30 mahout-math-0.6.jar drwxr-xr-x 3 yuta yuta 4096 4月 8 19:15 mathexport JAVA_HOME=/usr/java/default/ export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/usr/lib/hadoop-0.20 export HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ export PATH=$HADOOP_HOME/bin:$PATH
Madmagi Words
魔法少女まどか☆マギカの台詞をScrapingします。Scrapingした結果をNLTKを利用して単語区切りとMecabによるMAした結果をHadoop HDFS上に保存します。
Scraping
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys,re,urllib,urllib2 urls = ( 'http://www22.atwiki.jp/madoka-magica/pages/170.html', 'http://www22.atwiki.jp/madoka-magica/pages/175.html', 'http://www22.atwiki.jp/madoka-magica/pages/179.html', 'http://www22.atwiki.jp/madoka-magica/pages/180.html', 'http://www22.atwiki.jp/madoka-magica/pages/200.html', 'http://www22.atwiki.jp/madoka-magica/pages/247.html', 'http://www22.atwiki.jp/madoka-magica/pages/244.html', 'http://www22.atwiki.jp/madoka-magica/pages/249.html', 'http://www22.atwiki.jp/madoka-magica/pages/250.html', 'http://www22.atwiki.jp/madoka-magica/pages/252.html', 'http://www22.atwiki.jp/madoka-magica/pages/241.html', 'http://www22.atwiki.jp/madoka-magica/pages/254.html' ) f = open( './madmagi.txt', 'w' ) opener = urllib2.build_opener() ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/ 534.51.22' referer = 'http://www22.atwiki.jp/madoka-magica/' opener.addheaders = [( 'User-Agent', ua ),( 'Referer', referer )] for url in urls: content = opener.open( url ).read() if re.compile( r'<div class="contents".*?>((.|\n)*?)</div>', re.M ).search( content ) is not None: data = re.compile( r'<div class="contents".*?>((.|\n)*?)</div>', re.M ).search( content ).group() if re.compile( r'「(.*?)」', re.M ).search( data ) is not None: lines = re.compile( r'「(.*?)」', re.M ).findall( data ) for line in lines: f.write( line + "\n" ) f.close()Word MA
スペース区切りで分かち書きを行います。
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') import nltk from nltk.corpus.reader import * from nltk.corpus.reader.util import * from nltk.text import Text jp_sent_tokenizer = nltk.RegexpTokenizer(u'[^ 「」!?。]*[!?。]') jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([ぁ-んー]+|[ァ-ンー]+|[\u4e00-\u9FFF]+|[^ぁ-んァ-ンー\u4e00-\u9FFF]+)') data = PlaintextCorpusReader( './', r'madmagi.txt', encoding='utf-8', para_block_reader=read_line_block, sent_tokenizer=jp_sent_tokenizer, word_tokenizer=jp_chartype_tokenizer ) #ファイル保存 f = open( './word.txt', 'w' ) for i in data.words(): f.write( i + " " ) f.closeMecab MA
#!/usr/bin/env python # -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') import MeCab mecab = MeCab.Tagger('-Ochasen') data = open( './madmagi.txt' ).read() f = open( './ma.txt', 'w' ) node = mecab.parseToNode( data ) phrases = node.next while phrases: try: k = node.surface f.write( k + " " ) node = node.next except AttributeError: break f.close()HDFS PUT
抽出したWordMA/MecabMAデータをHDFS上にputします。
$ alias hdfs='hadoop dfs' $ hdfs -mkdir madmagi $ hdfs -put data/ma.txt madmagi_in/ $ hdfs -put data/word.txt madmagi_in/ $ hdfs -lsr madmagi_in -rw-r--r-- 1 yuta supergroup 104440 2012-03-26 01:16 /user/yuta/madmagi_in/ma.txt -rw-r--r-- 1 yuta supergroup 101266 2012-03-26 01:16 /user/yuta/madmagi_in/word.txt
Clustering Theory
Mahoutの処理を行う前に理論的な内容を少し整理します。
TF/IDF
tf-idf - Wikipedia
情報検索に置ける単語の重み付けの手法。TFが単語出現頻度、IDFが逆文書頻度といった内容で単語の重要度を測る手法です。K-Means
K平均法 - Wikipedia
K平均法は単純なクラスタリングアルゴリズムで、クラスタの中心計算を反復させより精度の高いクラスタを導きだす仕組みです。以下に単純な処理の流れを記載します。Canopy Clustering
Canopy Clustering
Canopy Clusteringについても簡単に触れます。Canopyはクラスタリングの中心点からT1、T2という最大半径、最小半径を指定しそれらを同一クラスタとして認識させます。
Word Vector
Mahoutを利用してMecab MAした結果をvectorに変換します。※Word MAの方については記述しませんが、MecabMAと同様に行えばClustering可能だと思います。変換にはMahoutのseqdirectoryとseq2sparseコマンドを使います。まずはseqdirectoryとseq2sparseコマンドのhelpを見てみます。
$ bin/mahout seqdirectory -h MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --overwrite (-ow) If present, overwrite the output directory before running job --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Defaults to 64 --fileFilterClass (-filter) fileFilterClass The name of the class to use for file parsing. Default: org.apache.mahout.text.PrefixAd ditionFilter --keyPrefix (-prefix) keyPrefix The prefix to be prepended to the key --charset (-c) charset The name of the character encoding of the input files. Default to UTF-8 --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run $ bin/mahout seq2sparse -h MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar Usage: [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help --sequentialAccessVector --namedVector --logNormalize] Options --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --minDF (-md) minDF The minimum document frequency. Default is 1 --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less then 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. If maxDFSigma is also set, it will override this value. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:1 --overwrite (-ow) If set, overwrite the output directory --help (-h) Print out help --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false --logNormalize (-lnorm) (Optional) Whether output vectors should be logNormalize. If set true else falseseqdirectoryにてテキストファイルをSequenceFileに変換、seq2sparseにてSequenceFileをvectorにします。seq2sparseのオプションで重要な項目を表にします。
option 説明 minSupport レコードごとの最少登場回数 minDF 最少登場文書数 maxDFPercent 登場する文書の割合がこれを超えたら不採用 maxNGramSize 熟語の可能性を検討する最大語数 minLLR 熟語として採用するための同時発生確率の最少値 sequentialAccessVector 逐次アクセス用のファイル形式 namedVector 各ベクタに名前を付ける $ bin/mahout seqdirectory \ --input madmagi_in/ma.txt \ --output madmagi_out_ma/seq \ -c UTF-8 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 08:46:20 INFO common.AbstractJob: Command line arguments: {--charset=UTF-8, --chunkSize=64, --endPhase=2147483647, --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, --input=madmagi_in/ma.txt, --keyPrefix=, --output=madmagi_out_ma/seq, --startPhase=0, --tempDir=temp} 12/05/03 08:46:25 INFO driver.MahoutDriver: Program took 5835 ms (Minutes: 0.09725) $ bin/mahout seq2sparse \ --input madmagi_out_ma/seq \ --output madmagi_out_ma/vector \ --minSupport 10 \ --minDF 20 \ --maxDFPercent 40 \ --maxNGramSize 3それぞれの解析結果により複数種類のファイルが作成されます。
$ hdfs -ls madmagi_out_ma/vector Found 7 items drwxr-xr-x - yuta supergroup 0 2012-05-03 09:32 /user/yuta/madmagi_out_ma/vector/df-count -rw-r--r-- 1 yuta supergroup 17602 2012-05-03 09:30 /user/yuta/madmagi_out_ma/vector/dictionary.file-0 -rw-r--r-- 1 yuta supergroup 17613 2012-05-03 09:32 /user/yuta/madmagi_out_ma/vector/frequency.file-0 drwxr-xr-x - yuta supergroup 0 2012-05-03 09:31 /user/yuta/madmagi_out_ma/vector/tf-vectors drwxr-xr-x - yuta supergroup 0 2012-05-03 09:33 /user/yuta/madmagi_out_ma/vector/tfidf-vectors drwxr-xr-x - yuta supergroup 0 2012-05-03 09:29 /user/yuta/madmagi_out_ma/vector/tokenized-documents drwxr-xr-x - yuta supergroup 0 2012-05-03 09:30 /user/yuta/madmagi_out_ma/vector/wordcountseqdumperまたはvectordumpを使って作成されたSequenceFileの中身を見てみます。tfidfのvector値とwordcountのが分かります。wordcountはグリーフシード/ソウルジェム/ワルプルギス/マミなどの特徴的な単語数が多く目立ちます。
$ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 10:38:17 INFO common.AbstractJob: Command line arguments: æ--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000, --startPhase=0, --tempDir=tempå Input Path: /user/yuta/madmagi_out_ma/vector/tfidf-vectors/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: /ma.txt: Value: æ867:-6.465347766876221,866:-7.082433700561523,865:-9.58966064453125, 864:-15.704275131225586,863:-16.735137939453125,862:-9.369179725646973,861:-6.465347766876221,860:-15.299805641174316,859:-9.80518627166748, 858:-8.674174308776855,857:-6.780914306640625,856:-6.465347766876221,855:-8.674174308776855,854:-10.818595886230469,853:-7.918401718139648, $ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 10:41:25 INFO common.AbstractJob: Command line arguments: æ--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000, --startPhase=0, --tempDir=tempå Input Path: /user/yuta/madmagi_out_ma/vector/wordcount/ngrams/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable Key: アイツ: Value: 12.0 Key: アタシ: Value: 30.0 Key: アンタ: Value: 30.0 Key: エネルギー: Value: 12.0 Key: キュゥ: Value: 13.0 Key: グリーフシード: Value: 10.0 Key: ソウルジェム: Value: 17.0 Key: ダメ: Value: 15.0 Key: バカ: Value: 16.0 Key: ホント: Value: 10.0 Key: マミ: Value: 20.0 Key: ワルプルギス: Value: 10.0上のseq2sparseの結果のTF/IDFの値がマイナスになっているのが気持ち悪いのでseq2sparseの実行オプションを修正して再度実行します。
$ bin/mahout seq2sparse \ --input madmagi_out_ma/seq \ --output madmagi_out_ma_test/vector \ --maxDFPercent 40 \ --maxNGramSize 6 \ --sequentialAccessVector \ --namedVector $ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 12:58:43 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000, --startPhase=0, --tempDir=temp} Input Path: /user/yuta/madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.math.VectorWritable Key: /ma.txt: Value: /ma.txt:{0:0.4339554011821747,1:0.4339554011821747,2:0.811856210231781,3:2.1479697227478027,4:5.765240669250488,5:0.6861437559127808,6:10.023337364196777,7:1.0177156925201416,8:6.868295669555664,9:2.1479697227478027, 10:4.653653144836426,11:2.744575023651123,12:8.564437866210938,13:5.699537754058838,14:4.01262092590332,15:1.4716144800186157,16:4.572004318237305,17:1.3722875118255615,18:4.713962554931641, 19:1.708484172821045,20:5.341354846954346,21:2.0584311485290527 $ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 13:06:39 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000, --startPhase=0, --tempDir=temp} Input Path: /user/yuta/madmagi_out_ma_test/vector/wordcount/ngrams/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.DoubleWritable Key: アイツ: Value: 12.0 Key: アタシ: Value: 30.0 Key: アンタ: Value: 30.0 Key: イレギュラー: Value: 2.0 Key: インキュベーター: Value: 3.0 Key: ウザ: Value: 3.0 Key: ウゼェ: Value: 2.0 Key: エネルギー: Value: 12.0 Key: エントロピー: Value: 2.0 Key: オイ: Value: 5.0 Key: オイッ: Value: 2.0 Key: カッコ: Value: 3.0 Key: キュゥ: Value: 13.0 Key: キュウ: Value: 2.0 Key: クラス: Value: 2.0 Key: グリーフシード: Value: 10.0 Key: コイツ: Value: 5.0 Key: ゴメン: Value: 2.0 Key: ゼロ: Value: 2.0 Key: ソウルジェム: Value: 17.0 Key: ゾンビ: Value: 2.0 Key: タツヤ: Value: 2.0 Key: ダメ: Value: 15.0 Key: チッ: Value: 2.0 Key: ッ: Value: 5.0 Key: テメェ: Value: 9.0 Key: ナメ: Value: 3.0 Key: ハッ: Value: 3.0 Key: バカ: Value: 16.0 Key: バランス: Value: 2.0 Key: バレ: Value: 2.0 Key: ベテラン: Value: 2.0 Key: ホント: Value: 10.0 Key: ママ: Value: 2.0 Key: マミ: Value: 20.0 Key: ミス: Value: 3.0 Key: モノ: Value: 2.0 Key: リハビリ: Value: 4.0 Key: リンゴ: Value: 2.0 Key: ルール: Value: 3.0 Key: ワルプルギス: Value: 10.0
Clustering
以下ではCanopyとK-meansのClusteringを行います。vectorの値はTF/IDFの値を利用します。またclusterdumpというコマンドを使って実際に抽出されたClusteringを見てみます。まずは最初にcanopyとkmeansのhelpを確認します。
$ bin/mahout canopy -h MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean --t1 (-t1) t1 T1 threshold value --t2 (-t2) t2 T2 threshold value --t3 (-t3) t3 T3 (Reducer T1) threshold value --t4 (-t4) t4 T4 (Reducer T2) threshold value --clusterFilter (-cf,-clusterFilter) clusterFilter Cluster filter suppresses small canopies from mapper --overwrite (-ow) If present, overwrite the output directory before running job --clustering (-cl) If present, run clustering after the iterations have taken place --method (-xm) method The execution method to use: sequential or mapreduce. Default is mapreduce --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to run $ bin/mahout kmeans -h MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar usage: <command> [Generic Options] [Job-Specific Options] Generic Options: -archives <paths> comma separated archives to be unarchived on the compute machines. -conf <configuration file> specify an application configuration file -D <property=value> use value for given property -files <paths> comma separated files to be copied to the map reduce cluster -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -libjars <paths> comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile> name of the file with the tokens Job-Specific Options: --input (-i) input Path to job input directory. --output (-o) output The directory pathname for output. --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first --numClusters (-k) k The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path. --convergenceDelta (-cd) convergenceDelta The convergence delta value. Default is 0.5 --maxIter (-x) maxIter The maximum number of iterations. --overwrite (-ow) If present, overwrite the output directory before running job --clustering (-cl) If present, run clustering after the iterations have taken place --method (-xm) method The execution method to use: sequential or mapreduce. Default is mapreduce --help (-h) Print out help --tempDir tempDir Intermediate output directory --startPhase startPhase First phase to run --endPhase endPhase Last phase to runCanopy
$ bin/mahout canopy \ --input madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 \ --output madmagi_out_ma_test/canopy \ --t1 0.9 \ --t2 0.8 \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 13:44:35 INFO common.AbstractJob: Command line arguments: {--distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --input=madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000, --method=mapreduce, --output=madmagi_out_ma_test/canopy, --startPhase=0, --t1=0.8, --t2=0.7, --tempDir=temp} 12/05/03 13:44:35 INFO canopy.CanopyDriver: Build Clusters Input: madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 Out: madmagi_out_ma_test/canopy Measure: org.apache.mahout.common.distance.CosineDistanceMeasure@177d59d4 t1: 0.8 t2: 0.5 12/05/03 13:44:39 INFO input.FileInputFormat: Total input paths to process : 1 12/05/03 13:44:40 INFO mapred.JobClient: Running job: job_201205030840_0073 12/05/03 13:44:41 INFO mapred.JobClient: map 0% reduce 0% 12/05/03 13:44:54 INFO mapred.JobClient: map 100% reduce 0% 12/05/03 13:45:05 INFO mapred.JobClient: map 100% reduce 33% 12/05/03 13:45:07 INFO mapred.JobClient: map 100% reduce 100% 12/05/03 13:45:11 INFO mapred.JobClient: Job complete: job_201205030840_0073 $ bin/mahout seqdumper -s /user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 13:10:46 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=/user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000, --startPhase=0, --tempDir=temp} Input Path: /user/yuta/madmagi_out_ma_test/canopy/clusters-0-final/part-r-00000 Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.mahout.clustering.canopy.Canopy Key: C-0: Value: C-0: {0:0.4339554011821747,1:0.4339554011821747,2:0.811856210231781,3:2.1479697227478027,4:5.765240669250488,5:0.6861437559127808,6:10.023337364196777,7:1.0177156925201416,8:6.868295669555664,9:2.1479697227478027,10:4.653653144836426, 11:2.744575023651123,12:8.564437866210938,13:5.699537754058838,14:4.01262092590332,15:1.4716144800186157,16:4.572004318237305,17:1.3722875118255615,18:4.713962554931641,19:1.708484172821045,20:5.341354846954346,21:2.0584311485290527, $ bin/mahout clusterdump \ --seqFileDir madmagi_out_ma_test/canopy/clusters-0-final \ --dictionary madmagi_out_ma_test/vector/dictionary.file-0 \ --dictionaryType sequencefile \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ --numWords 100 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 13:47:14 INFO common.AbstractJob: Command line arguments: {--dictionary=madmagi_out_ma_test/vector/dictionary.file-0, --dictionaryType=sequencefile, --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --numWords=100, --outputFormat=TEXT, --seqFileDir=madmagi_out_ma_test/canopy/clusters-0-final, --startPhase=0, --tempDir=temp} C-0{n=1 c=[10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604, ぼ:0.686, ぽ:0.531, ま:5.252, (略) Top Terms: な => 10.237117767333984 い => 10.023337364196777 の => 9.205584526062012 て => 9.14400863647461 っ => 8.993457794189453 ん => 8.597356796264648 か => 8.564437866210938 た => 8.089515686035156 だ => 7.547581672668457 に => 7.196336269378662 し => 7.183239936828613 は => 7.08424711227417 う => 6.868295669555664 も => 6.659483432769775 る => 6.588409423828125 と => 6.5309929847717285 ら => 6.296092987060547 っ て => 6.137056350708008 (略) そ う => 2.5489108562469482 っ と => 2.5489108562469482 か ち => 2.5116984844207764 て い => 2.5116984844207764 魔 法 => 2.5116984844207764 か ち ゃ => 2.4928841590881348 か ち ゃ ん => 2.4928841590881348 さ や か ち => 2.4928841590881348 さ や か ち ゃ => 2.4928841590881348 さ や か ち ゃ ん => 2.4928841590881348K-Means
canopyで抽出したclusterをK-Meansに当てはめます。--numClustersで10と指定していますが、なぜか1個のClusterしか作成されません。※ここは原因を調査中で分かり次第内容を追記します。
$ bin/mahout kmeans \ --input madmagi_out_ma_test/vector/tfidf-vectors/part-r-00000 \ --output madmagi_out_ma_test/kmeans \ --clusters madmagi_out_ma_test/canopy/clusters-0-final \ --maxIter 40 \ --numClusters 10 \ --convergenceDelta 0.01 \ --clustering \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure $ bin/mahout clusterdump \ --seqFileDir madmagi_out_ma_test/kmeans/clusters-1-final \ --dictionary madmagi_out_ma_test/vector/dictionary.file-0 \ --pointsDir madmagi_out_ma_test/kmeans/clusteredPoints \ --dictionaryType sequencefile \ -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ --numWords 100 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop-0.20 HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ MAHOUT-JOB: /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/05/03 14:03:20 INFO common.AbstractJob: Command line arguments: {--dictionary=madmagi_out_ma_test/vector/dictionary.file-0, --dictionaryType=sequencefile, --distanceMeasure=org.apache.mahout.common.distance.CosineDistanceMeasure, --endPhase=2147483647, --numWords=100, --outputFormat=TEXT, --pointsDir=madmagi_out_ma_test/kmeans/clusteredPoints, --seqFileDir=madmagi_out_ma_test/kmeans/clusters-1-final, --startPhase=0, --tempDir=temp} VL-0{n=1 c=[10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604, ぼ:0.686, ぽ:0.531, ま:5.252, (略) Top Terms: な => 10.237117767333984 い => 10.023337364196777 の => 9.205584526062012 て => 9.14400863647461 っ => 8.993457794189453 ん => 8.597356796264648 か => 8.564437866210938 た => 8.089515686035156 だ => 7.547581672668457 に => 7.196336269378662 し => 7.183239936828613 は => 7.08424711227417 う => 6.868295669555664 も => 6.659483432769775 る => 6.588409423828125 と => 6.5309929847717285 ら => 6.296092987060547 っ て => 6.137056350708008 (略) そ う => 2.5489108562469482 っ と => 2.5489108562469482 か ち => 2.5116984844207764 て い => 2.5116984844207764 魔 法 => 2.5116984844207764 か ち ゃ => 2.4928841590881348 か ち ゃ ん => 2.4928841590881348 さ や か ち => 2.4928841590881348 さ や か ち ゃ => 2.4928841590881348 さ や か ち ゃ ん => 2.4928841590881348 Weight : [props - optional]: Point: 1.0 : [distance=0.0]: /ma.txt = [10:0.434, 100:0.434, 々:0.812, ぁ:2.148, あ:5.765, ぃ:0.686, い:10.023, ぅ:1.018, う:6.868, ぇ:2.148, え:4.654, お:2.745, か:8.564, が:5.700, き:4.013, ぎ:1.472, く:4.572, ぐ:1.372, け:4.714, げ:1.708, こ:5.341, ご:2.058, さ:5.814, ざ:0.921, し:7.183, じ:3.906, す:4.252, ず:2.081, せ:2.959, ぜ:0.752, そ:5.540, ぞ:0.970, た:8.090, だ:7.548, ち:5.226, っ:8.993, つ:3.319, づ:1.018, て:9.144, で:6.044, と:6.531, ど:4.976, な:10.237, に:7.196, ぬ:0.812, ね:4.852, の:9.206, は:7.084, ば:2.727, ぱ:0.868, ひ:1.302, び:1.106, ふ:1.302, ぶ:1.018, へ:0.921, べ:1.867, ほ:2.604,
Graph Display
Required JAR
Jar File Download examples (example source code) Organized by topic
Clustering結果が生のデータだと視覚的に分かりづらいので、GUIのGraphを生成します。Graph生成には幾つかJARファイルが必要なので上のサイトから検索してDownLoadします。try and errorを繰り返した結果以下のJARが必要ということが分かりました。$ pwd /home/yuta/work/src/mahout/mahout-distribution-0.6 $ wget http://www.java2s.com/Code/JarDownload/uncommons/uncommons-maths-1.2.jar.zip $ wget http://www.java2s.com/Code/JarDownload/com.google/com.google.common_1.0.0.201004262004.jar.zip $ wget http://www.java2s.com/Code/JarDownload/google-collections/google-collections-1.0.jar.zip $ unzip *.zipさらにHADOOP_CLASSPATHをexportしないとjava errorが出る事から以下のように設定します。.zshrcなどに書いておくと良いと思います。
export JAVA_HOME=/usr/java/default/ export PATH=$JAVA_HOME/bin:$PATH export HADOOP_HOME=/usr/lib/hadoop-0.20 export HADOOP_CONF_DIR=/usr/lib/hadoop-0.20/conf/ export PATH=$HADOOP_HOME/bin:$PATH export MAHOUT_HOME=/home/yuta/work/src/mahout/mahout-distribution-0.6 export HADOOP_CLASSPATH=$MAHOUT_HOME/mahout-math-0.6.jar:$MAHOUT_HOME/mahout-core-0.6.jar:$MAHOUT_HOME/ommons-cli-2.0-mahout.jar:$MAHOUT_HOME/mahout-integration-0.6.jar:$MAHOUT_HOME/google-collections-1.0.jar:$MAHOUT_HOME/uncommons-maths-1.2.jar:$MAHOUT_HOME/com.google.common_1.0.0.201004262004.jar:$MAHOUT_HOME/lib/mahout-collections-1.0.jarSample Graph Image
JARファイルと環境変数の設定が完了したら以下のコマンドを実行するとサンプルのKmeansのグラフ画像が出力されます。
$ hadoop jar /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6.jar org.apache.mahout.clustering.display.DisplayClustering $ hadoop jar /home/yuta/work/src/mahout/mahout-distribution-0.6/mahout-examples-0.6.jar \ org.apache.mahout.clustering.display.DisplayKMeansDisplayKMeans.java
mahout-examples-0.6-sources.jarを取得してsampleのClusteringがどうなっているのかを見てみます。どうやらimport org.apache.mahout.common.RandomUtils;とDisplayClusteringクラスでランダムなSampleデータを取得して表示しているだけのようです。これを応用してMadmagi Wordにも当てはめられればGraph Imageが出力されると思います。この続きは次回行います。
$ wget http://mirrors.ibiblio.org/maven2/org/apache/mahout/mahout-examples/0.6/mahout-examples-0.6-sources.jar $ unzip mahout-examples-0.6-sources.jar $ vi org/apache/mahout/clustering/display/DisplayKMeans.java/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.mahout.clustering.display; import java.awt.Graphics; import java.awt.Graphics2D; import java.io.IOException; import java.util.Collection; import java.util.List; import com.google.common.collect.Lists; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.mahout.clustering.Cluster; import org.apache.mahout.clustering.ClusterClassifier; import org.apache.mahout.clustering.ClusterIterator; import org.apache.mahout.clustering.ClusteringPolicy; import org.apache.mahout.clustering.KMeansClusteringPolicy; import org.apache.mahout.clustering.kmeans.KMeansDriver; import org.apache.mahout.clustering.kmeans.RandomSeedGenerator; import org.apache.mahout.common.HadoopUtil; import org.apache.mahout.common.RandomUtils; import org.apache.mahout.common.distance.DistanceMeasure; import org.apache.mahout.common.distance.ManhattanDistanceMeasure; import org.apache.mahout.math.Vector; public class DisplayKMeans extends DisplayClustering { DisplayKMeans() { initialize(); this.setTitle("k-Means Clusters (>" + (int) (significance * 100) + "% of population)"); } public static void main(String[] args) throws Exception { DistanceMeasure measure = new ManhattanDistanceMeasure(); Path samples = new Path("samples"); Path output = new Path("output"); Configuration conf = new Configuration(); HadoopUtil.delete(conf, samples); HadoopUtil.delete(conf, output); RandomUtils.useTestSeed(); DisplayClustering.generateSamples(); writeSampleData(samples); boolean runClusterer = false; if (runClusterer) { int numClusters = 3; runSequentialKMeansClusterer(conf, samples, output, measure, numClusters); } else { int maxIterations = 10; runSequentialKMeansClassifier(conf, samples, output, measure, maxIterations); } new DisplayKMeans(); } private static void runSequentialKMeansClassifier(Configuration conf, Path samples, Path output, DistanceMeasure measure, int numClusters) throws IOException { Collection<Vector> points = Lists.newArrayList(); for (int i = 0; i < numClusters; i++) { points.add(SAMPLE_DATA.get(i).get()); } List<Cluster> initialClusters = Lists.newArrayList(); int id = 0; for (Vector point : points) { initialClusters.add(new org.apache.mahout.clustering.kmeans.Cluster( point, id++, measure)); } ClusterClassifier prior = new ClusterClassifier(initialClusters); Path priorClassifier = new Path(output, "clusters-0"); writeClassifier(prior, conf, priorClassifier); int maxIter = 10; ClusteringPolicy policy = new KMeansClusteringPolicy(); new ClusterIterator(policy).iterateSeq(samples, priorClassifier, output, maxIter); for (int i = 1; i <= maxIter; i++) { ClusterClassifier posterior = readClassifier(conf, new Path(output, "classifier-" + i)); CLUSTERS.add(posterior.getModels()); } } private static void runSequentialKMeansClusterer(Configuration conf, Path samples, Path output, DistanceMeasure measure, int maxIterations) throws IOException, InterruptedException, ClassNotFoundException { Path clusters = RandomSeedGenerator.buildRandom(conf, samples, new Path( output, "clusters-0"), 3, measure); double distanceThreshold = 0.001; KMeansDriver.run(samples, clusters, output, measure, distanceThreshold, maxIterations, true, true); loadClusters(output); } // Override the paint() method @Override public void paint(Graphics g) { plotSampleData((Graphics2D) g); plotClusters((Graphics2D) g); } }