Mahoutã§åæ£ã¬ã³ã¡ã³ã(2)
ããããæ©éHadoopã®çä¼¼åæ£ç°å¢ãä½ã£ã¦Mahoutãåãã¦ã¿ã¾ãããã
Hadoopã®ã»ããã¢ããã¨Mahoutã®å ¥æ
ã¾ãã¯å©ç¨ããHadoopã®ã»ããã¢ãããããã¯æ¬é¡ãããªãã®ã§è¦ç¹ã®ã¿ã
- Hadoopã®ãã¼ã¸ã§ã³ã¯ææ°ã§ã¯ãªã v0.20.2 ã使ãã¾ãããã
- Apache Download Mirrorsãã hadoop-0.20.2.tar.gz ããã¦ã³ãã¼ããã¾ãã
- å種è¨å®ã¯åºæ¬çã«Hadoop擬似分散環境メモ(Hishidama's Hadoop pseudo-distributed Memo)åç §
- è¨å®å¾ãèµ·åããåã«ãhadoop namenode -format ãå¿ããã«ãæåãããå¿ãã¦ãããã¾ããã
- start-all.sh ã§Hadoopèµ·åãhadoop fs -ls ã§HDFSã«æ¥ç¶ã§ããäºã確èªã
- ã¡ãªã¿ã«ãHadoopãè½ã¨ãæ㯠stop-all.sh ã§ãã
- Mahoutã¯Apache Download Mirrorsãã mahout-distribution-0.5.tar.gz ããã¦ã³ãã¼ãã
- å±éã㦠mahout-core-0.5-job.jar ãé©å½ãªå ´æã«é ç½®ã
å ¥åãã¼ã¿ãã¡ã¤ã«ã®æºå
1,101,5.0 1,102,3.0 1,103,2.5 2,101,2.0 2,102,2.5 2,103,5.0 2,104,2.0 3,101,2.5 3,104,4.0 3,105,4.5 3,107,5.0 4,101,5.0 4,103,3.0 4,104,4.5 4,106,4.0 5,101,4.0 5,102,3.0 5,103,2.0 5,104,4.0 5,105,3.5 5,106,4.0
ã¾ããè©ä¾¡ãã¼ã¿ã®ãã¡ã¤ã«(prefs.txt)ãå使ã£ãã®ã¨åããã¡ã¤ã«ã§ãããã¦ã¼ã¶ID,ã¢ã¤ãã ID,è©ç¹:(long,long,float) ã£ã¦ããã¬ã³ã¼ãã«ãªã£ã¦ã¾ãã
ååãHadoopç¨ãã®é ã§èª¬æããéããæ¬æ¥ãããªå°ããªãã¼ã¿ãMapReduceã«æãã¦ãããã¾æå³ããã¾ãããããã§æããã®ã¯ããã¾ã§ãä¾ã£ã¦ãã¨ã§ãé¡ããã¾ããã大è¦æ¨¡ãªå¥´ã¯ãæ¤è¨¼ãã¦ã¿ã¦ä¸æãè¡ã£ããããããã¦ãç´¹ä»ãããã¨æãã¾ãã
1
次ã«ãMahoutã«å¯¾ãã¦ã誰ã«å¯¾ããã¬ã³ã¡ã³ããããã®ãããæ示ãããã¡ã¤ã«(users.txt)ãæ¹è¡åºåãã§ã¦ã¼ã¶IDãåæãã¾ãããã®ãã¡ã¤ã«èªä½çç¥ãå¯è½ï¼ãã®å ´åå ¨å¡ã«å¯¾ããã¬ã³ã¡ã³ããè¨ç®ããï¼ãªãã ãã©ãä»åã¯ã¨ããããæå®ãã¦ã¿ããã§ãããã§æ³¨æç¹ããã®ãã¡ã¤ã«ã®æ«å°¾ã«æ¹è¡ããããªããã¨ãä¸è¨ã®ä¾ã ã¨ããã åã«1ã¨æ¸ãã¦ããã ãã®ãã¡ã¤ã«ãããªãã¨è½ã¡ã¾ããçããããããããã30時間近く待った挙げ句に落ちるãããã原因はNumberFormatExceptionãªãã¦æ²åã«éãã¬ããæ°ãã¤ãã¦ãã ãããã£ã¦ãããããã·ã§ãã¯ã§ããã·ã§ãã¯ããã¦ä¿º原因究明ãã¦パッチéã£ãããã¾ãæ©éåãè¾¼ãã§ããããã¿ããã§ã次æãã¼ã¸ã§ã³v0.6ã§ã¯ç´ãã¨æãã¾ãã
ãã¦ããããªç®ã«éããªãããã«ããæåã®è©¦è¡æã¯Hadoopç¨ãã©ã¼ã®ã¨ãããã©ããã¼ãã¨è¨ããã«ã¹ã¢ã¼ã«ãã¼ã¿ã§è©¦ããªããã£ã¦ãã¨ã§ãããå®ç¨ãããªãã¦ãããã¾ã§ã試è¡ãªãã ããã
ãããæ°ãåãç´ãã¦ã以ä¸ãHDFSã«éãè¾¼ã¿ã¾ãã
$ hadoop fs -put /path/to/users.txt input/users.txt $ hadoop fs -put /path/to/prefs.txt input/prefs.txt
éãããã©ãã確èªã
$ hadoop fs -ls input Found 2 items -rw-r--r-- 1 daisuke supergroup 231 2011-06-03 14:17 /user/daisuke/input/prefs.txt -rw-r--r-- 1 daisuke supergroup 2 2011-06-04 23:10 /user/daisuke/input/users.txt
ããããMahoutèµ·å
$ hadoop jar \ /path/to/mahout-core-0.5-job.jar \ org.apache.mahout.cf.taste.hadoop.item.RecommenderJob \ -Dmapred.output.dir=output \ -Dmapred.input.dir=input/prefs.txt \ --usersFile input/users.txt \ --similarityClassname SIMILARITY_PEARSON_CORRELATION
- DLããMahoutã®jarãã¡ã¤ã«ã®ãã¹ï¼ãã¼ã«ã«ãã¡ã¤ã«ã·ã¹ãã ä¸ï¼
- ã¬ã³ã¡ã³ããè¡ãDriverã®FQCN
- çµæã®åºåå ãã¹ï¼HDFSä¸ï¼
- å ¥åãã¡ã¤ã«(ä¸è¨)ã®ãã¹ï¼HDFSä¸ï¼
- ã¦ã¼ã¶ãã¡ã¤ã«(ä¸è¨)ã®ãã¹ï¼HDFSä¸ï¼
- é¡ä¼¼åº¦è¡åãä½ãéã®ãé¡ä¼¼åº¦ãã«ä½ã使ããæå®ï¼ããã§ã¯ãã¢ã½ã³ç¸é¢ä¿æ°ï¼*1
ãã®ã¾ã¾5åã»ã©ãæ£åº§ãã¦å¾ ã¤ãã¾ããå¾ æ©æéã¯ãã·ã³ã¹ããã¯ä¾åã ã¨æãã¾ãããã¨ã©ã¼ããããã®ã表示ãããã«ããã³ããã«æ»ã£ã¦ãããå®äºãçµæãè¦ããã
$ hadoop fs -ls output Found 2 items drwxr-xr-x - daisuke supergroup 0 2011-06-04 23:19 /user/daisuke/output/_logs -rw-r--r-- 1 daisuke supergroup 18 2011-06-04 23:20 /user/daisuke/output/part-r-00000
ãã® part-r-00000 ã£ã¦ãã¡ã¤ã«ãçµæãä¸èº«ãè¦ã¦ã¿ããã
$ hadoop fs -cat output/part-r-00000 1 [104:3.9098573]
ã¦ã¼ã¶1ã¸ã®ãå§ãã¯ã104çªã®ã¢ã¤ãã ã§ã3.9ç¹ããããã¤ããã§ããããã¨ãããåãã§ããã
ååè¦ãéããé¡ä¼¼åº¦è¡åã«NaNãå ¥ãã¾ãã£ã¦ãã¦ãä»ã®ã¢ã¤ãã ã«å¯¾ããã¬ã³ã¡ã³ããè¨ç®ã§ããªãã£ãã¿ããã§ãããsimilarityClassnameãªãã·ã§ã³ãå¤æ´ãã¦é¡ä¼¼åº¦è¡åã®è¨ç®æ¹å¼ãå¤ãã*2ã¨ãè¤æ°ã®ã¢ã¤ãã ã«ã¤ãã¦è©ç¹ã®ãåãããããããããã¾ãã
1 [105:3.875,104:3.7222223,106:3.6]
*1:ä»ã«ãSIMILARITY_COOCCURRENCE, SIMILARITY_EUCLIDEAN_DISTANCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_UNCENTERED_COSINE, SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE, SIMILARITY_CITY_BLOCK ã使ããããã§ããã¾ã å ¨é¨ã¯è©¦ãã¦ã¾ããã
*2:ããã§ã¯SIMILARITY_COOCCURRENCEã使ã£ã¦ã¿ããããããprefs.txtã®ç¹æ§ä¸Co-occurrenceã¯æããé©ããªããã§ããã©ãã