Mahoutã§åæ£ã¬ã³ã¡ã³ã(1)
ãã¦ãã¡ãã£ã¨éãããã¾ãããã
ååã¾ã§ããã£ããã¬ã³ã¡ã³ããæãã¦ã¯ã©ã¹ã¿ãªã³ã°ã®ä¸çããç´¹ä»ãã¦ã¿ã訳ã§ããããã¾ãã¦ã±ããããããããªãã®ã§ã¬ã³ã¡ã³ãã«æ»ã£ã¦ã¿ã¾ãã
ãããªä¸ã§Mahoutãä¸æ¼ãã§ããã®ã¯ãã¹ã±ã¼ã©ããªãã£ã®ç¢ºä¿ã«éç¹ãç½®ããã¦ãããã¨ã§ãã
æ©æ¢°å¦ç¿ã¨ããã®ã¯ãå½ç¶ãè¨ç®ã«åºã¥ãã¦çµæãåºãããã§ããããã®åºç¤ã¨ãªããã¼ã¿ãå¤ããã°å¤ãã»ã©ã確ããããçµæãåºãã¦ããã¾ãããããããããã¼ã¿ãå¤ããã°å¤ãã»ã©ãææ°çã«è¨ç®éãå¢å ããå¾åãããã¾ãã
Apache Mahoutで機械学習してみるべ - 都元ダイスケ IT-PRESS
ã¨ããå°å ¥ããç´¹ä»ã«å ¥ã£ãã¬ã³ã¡ã³ãã§ãããå®ã¯ãã®ã¢ã«ã´ãªãºã ã¯åæ£å¦çã§ãã¾ãããã§ãã¾ããã£ããã§ãã¾ãããã ã£ã¦MapReduceãã©ãã¤ã ã§æ¸ãã¦ãªããã ããã
ã¨ãããã¨ã§ãå æ¥ç´¹ä»ããå¦çããã®ã¾ã¾MapReduceå¦çã«å¤æã§ããã°å¬ããã£ããã§ãããæ®å¿µãªãããç´¹ä»ããã¢ã«ã´ãªãºã ããã®ã¾ã¾ãåæ£å¦çã®æ©æµãåããããããã«MapReduceãã©ãã¤ã ã«å¤æãããã¨ã¯ãã§ããªãããã§ã*1ãã¾ããåºæ¥ãªãã®ãç°æ§ã«é£ããã®ããããããªããã©ã¨ã«ããMahoutã«ã¯å®è£ ããªãã§ãã
ã¤ã¾ãããåæ£å¦çã®æ©æµãåããããMapReduceãã©ãã¤ã ã¨ãã¦è¨è¿°ã§ããã¬ã³ã¡ã³ãã¼ã·ã§ã³ã¢ã«ã´ãªãºã ãã£ã¦ã®ãå¿ è¦ãªãã§ãããã¨ããããã§ãå æ¥ç´¹ä»ããã¢ã«ã´ãªãºã ã¯ãã¾ããµãããªå¿ãã¦ãã ãããæ°ããã®è¡ãã¾ãã
Hadoopç¨
æ°ããã¢ã«ã´ãªãºã ãç´¹ä»ããåã«ãHadoopã£ã¦ã®ã¯MapReduceãã©ãã¤ã ã使ã£ã¦ä¸æãäºã¹ã±ã¼ã©ããªãã£ãå¾ããããã©ãããã©ã¼ã ãªè¨³ã§ãããã¹ã±ã¼ã©ããªãã£ã¯å¿ è¦ã§ã¯ãªãå ´åãã¤ã¾ãæ±ããã¼ã¿ãå°ããå¾æ¥ã®éåæ£ã¬ã³ã¡ã³ãã¢ã«ã´ãªãºã ã§ãç¾å®çãªæéå ã§è¨ç®ãçµäºããå ´åãéã«æéãé£ãã¾ãã
éåæ£ã¬ã³ã¡ã³ãã¯ããã¼ã¿ãå°ããç¶æ ã§ããã°ããªã¢ã«ã¿ã¤ã å¦ç*2ããã§ãã¾ããããããããããç´¹ä»ããåæ£ã¬ã³ã¡ã³ãã®ã¢ã«ã´ãªãºã ã§ã¯ãã©ããªã«ãã¼ã¿ãå°ããã¦ãå°ãªãã¨ã5åç¨åº¦ã®å¦çæéå¿ è¦ãã¤ã¾ãããããå¦ç*3ãããã§ãã¾ããã
ãããå¦çã§ã¯ããææ°ã®è©ä¾¡æ å ±ã«ãªã¢ã«ã¿ã¤ã ã§è¿½å¾ããããã¨ãã§ããªããªããã¨ãããã¬ã¼ããªãã¯èªèãã¦ããã¾ãããã
é¡ä¼¼åº¦è¡å(similarity matrix)
æ°å¦ã®ææ¥ã§ç¿ã£ããè¡åãã£ã¦è¦ãã¦ã¾ããï¼*4
ãããªã®*5ãã¾ããããããªãã§ãããã§ä½¿ãã®ã¯ãã¯ãã«ã®ããç®ï¼å ç©ï¼ã ãã§ãã
ã§ãã¡ãã£ã¨æãåºãã¦ã»ãããã§ãããéåæ£ã¬ã³ã¡ã³ãã§ãSimilarityï¼é¡ä¼¼åº¦ï¼ãã£ã¦ããæ¦å¿µãåºã¦ãã¾ãããã¦ã¼ã¶å士ãã©ãã ãä¼¼ã¦ãããã表ãå¤ããã®é¡ä¼¼åº¦ã¯ãã¦ã¼ã¶ã ãã§ãªããã¢ã¤ãã ã«ãé©ç¨ã§ãã¾ã*6ãã§ã以åæããä¾ã§ã¯ããã¢ã½ã³ç¸é¢ä¿æ°ãã£ã¦ãã-1ã1ã®ç¯å²ã®å¤ãSimilarityã¨ãã¦å©ç¨ãã¾ããã
ã¢ã¤ãã ï¼101ã107ï¼å士ã®ãã¢ã½ã³ç¸é¢ä¿æ°ããå ¨çµã¿åããã§ç®åºãã¦ã表ã«ãã¦ã¿ã¾ããããã¼ã¿æ°ãå°ãªãã¦è¨ç®ã§ããªã(NaNï¼ã¨ãããå¤ãã§ããã
101 | 102 | 103 | 104 | 105 | 106 | 107 | |
---|---|---|---|---|---|---|---|
101 | 1.00 | 0.94 | -0.80 | 0.77 | -1.00 | NaN | NaN |
102 | 0.94 | 1.00 | -0.98 | 0.99 | NaN | NaN | NaN |
103 | -0.80 | -0.98 | 1.00 | -0.86 | NaN | NaN | NaN |
104 | 0.77 | 0.99 | -0.86 | 1.00 | NaN | NaN | NaN |
105 | -1.00 | NaN | NaN | NaN | 1.00 | NaN | NaN |
106 | NaN | NaN | NaN | NaN | NaN | 1.00 | NaN |
107 | NaN | NaN | NaN | NaN | NaN | NaN | 1.00 |
ã¾ãã101vs101ã107vs107ãªã©ãåãã¢ã¤ãã å士ã®é¡ä¼¼æ§ã¯ãå®å ¨ä¸è´ã£ã¦ãã¨ã§1.00ã«ãªã£ã¦ã¾ããå½ç¶ã§ãããããããããªå¤ã¯å½ããåããã¦è¨ç®ã«å«ãã¦ãå ¨ãæå³ããªãã®ã§ãå ¨é¨NaNã«ãã¦ãã¾ãã¾ã*7ã
ãã®è¡¨ã®ä¸èº«ãè¡åã¨ãã¦èãããã§ã*8ãããããã®ãé¡ä¼¼åº¦è¡å(similarity matrix)ã¨å¼ã¶ã£ã½ãã§ãã
ã¡ãªã¿ã«ä»åã¯ãã¢ã½ã³ç¸é¢ä¿æ°ã使ã£ãé¡ä¼¼åº¦è¡åãä½ãã¾ãããããã¼ã¿ã®æ§è³ªã«ãã£ã¦ãå ±èµ·(co-occurence)ãã¦ã¼ã°ãªããè·é¢(Euclidean distance)ãªã©ãæ§ã ãªãé¡ä¼¼åº¦ãã使ããã¨ãããã¾ãã
ã§ã ã次ã«ã¦ã¼ã¶ã®è©ä¾¡è¡¨ãèãã¾ããããã§ã¯ã¦ã¼ã¶1ã®äººã«å¯¾ãã¦ã¬ã³ã¡ã³ãããã¦ã¿ããã¨æãã¾ããã¦ã¼ã¶1ã®è©ä¾¡ãã¼ã¿ã¯ãããªãããã
ã¦ã¼ã¶1 | |
---|---|
101 | 5.0 |
102 | 3.0 |
103 | 2.5 |
104 | - |
105 | - |
106 | - |
107 | - |
æªè©ä¾¡ã®é¨åã¯0ã¨ãã¦ããããè¡åã«ããã¨ãããªãããããããã¦ã¼ã¶ãã¯ãã«ã
ã§ãé¡ä¼¼åº¦è¡åã¨ã¦ã¼ã¶ãã¯ãã«ã®å ç©ããã¨ãã¾ã¼ãã
ã¯ããããå«ã«ãªã£ã¦ãã¾ãããã俺ãã§ãã
ã¾ãããããªæãã§æ±ããè¡åã¯ãaãfã101ã107ã®åã¢ã¤ãã ã«å¯¾ããããªã¹ã¹ã¡åº¦*9ãã«ãªã£ã¦ãã¾ãããã®å¤ã大ããã¢ã¤ãã ããå§ãã¡ã ã¼ãã¨ã§ããã
ä»åã®ã¾ã¨ã
ã¬ã³ã¡ã³ããåæ£ã¢ã«ã´ãªãºã ã«å¯¾å¿ãããããã«ã¯ããããªæãã§æ°å¦ã®ä¸çã«ã¯ã¾ãè¾¼ã¿ã¾ãã俺ãä½æ ã ãã¯åããã¾ããããä¸è¨ã®ã¢ã«ã´ãªãºã ã¯MapReduceãã©ãã¤ã ã§è¡¨ç¾ã§ããã¿ããã§ãã
ããããçå±ã¯ãè ¹ãã£ã±ãã§ããããã¦æ¬¡åã¯å®éã«Mahoutã使ã£ã¦åæ£ã¬ã³ã¡ã³ããã¦ã¿ã¾ãã
ããããã°
Apache Mahout v0.5 ãªãªã¼ã¹ããã¾ããããããã§ã¨ããããã¾ãã
*1:ãã ãããåæ£å¦çã®æ©æµããã¾ãåããããªãå½¢ã§ãå æ¥ã®ã¢ã«ã´ãªãºã ãHadoopä¸ã«ç¡çç¢çä¹ãããã¨ã¯å¯è½ã§ããã§ãæå³ãªãã§ããã
*2:ã¤ã¾ããWebã¢ããªã«ããã¦ãã¦ã¼ã¶ããã®ãªã¯ã¨ã¹ããããªã¬ã¨ãã¦è¨ç®ãèµ°ããããã®è¨ç®çµæãã¬ã¹ãã³ã¹ã§è¿ããããªæãã
*3:ãªã¯ã¨ã¹ãã¨ã¯ç¬ç«ãããããã§äºåã«è¨ç®ãã¦ããã¦ããªã¯ã¨ã¹ãæã«è¨ç®çµæã ããåç §ããã
*4:ã¾ãããã¯ã¦ãªã®TeXè¨æ³ã®ãä¸è©±ã«ãªãæ¥ãæ¥ããã¨ã¯â¦ã
*5:俺ããã¾å¥½ããããªãã£ããªãããªãã縦横ç¡å°½ã«æããã足ãããããã ããªãã ãã©ãæ°å¼ä½ã£ã¦ããã¡ã«ã©ããè¦ã¦ããã ãåãããªããªã£ã¦â¦ã
*6:ã¢ã¤ãã å士ãã©ãã ãä¼¼ã¦ããããè¨ç®ã§ããã
*7:NaNã«ããªãã¨æ害ãªã®ãã1.00ã®ã¾ã¾ã§ãããã®ãããªãã¦ã®ã¯ãããããã¾ãããã¨ããããMahoutãä¸ã§ã¯NaNã«ãã¦ãã®ã§ãw
*8:ãããã»ãã¨ã«NaNã ããã ãªãæå¾ã¾ã§è¨ç®ã§ãããä¸å®ã«ãªã£ã¦ããw
*9:ãã ãããã®æ®µéã§ã¯éåæ£ã®æã®ãããªãäºæ³è©ç¹ãã§ã¯ãªãã