Hadoop Streamingã§åæ£å¦çãPHPã§ãã£ã¦ã¿ã
「何番煎じか分からないけど集合知プログラミングをPHPでやってみたシリーズ」ã§æ±ã£ã¦ããéåç¥ããã°ã©ãã³ã°ã¯ãã¨ããè¨ç®éãå¤ããªããã¡ã§ããã¾ãã¢ã«ã´ãªãºã ãä½ããªãã¨ã¡ã¢ãªä¸è¶³ã«é¥ããã¡ã§ãã
ç¾ã«前回の記事ã§ã¯ããã®è¶ããããªãå£ãä½é¨ãã¦ãã¾ã£ãã®ã§ããã©ãããããã¨ãããããã¨ããäºã§ãæè¿ã¡ãã£ã¨èå³ã®ãããGoogleã®ããã¯ã¨ã³ãã§ã使ããã¦ãã"MapReduce"ã«é¢ãã¦å°ã調ã¹ã¦ã¿ã¾ããã
"MapReduce"ã«é¢ãã¦ã¯ãã"Googleãæ¯ããæè¡"èªãã以ä¸ããã§ããããã§ãããid:naoyaãããæ¸ããã¦ããè¨äºãé常ã«åãããããã£ãã®ã§ããã®è¨äºã®ãªã³ã¯ãè²¼ã£ã¦ããã¾ãã
âMapReduce - naoyaのはてなダイアリー
"Googleãæ¯ããæè¡"ããã£ããã ããAmazonã®ãªã³ã¯ãè²¼ã£ã¦ããã¾ãã
Googleãæ¯ããæè¡ ?巨大ã·ã¹ãã ã®å å´ã®ä¸ç (WEB+DB PRESSãã©ã¹ã·ãªã¼ãº)
- ä½è : 西ç°åä»
- åºç社/ã¡ã¼ã«ã¼: æè¡è©è«ç¤¾
- çºå£²æ¥: 2008/03/28
- ã¡ãã£ã¢: åè¡æ¬ï¼ã½ããã«ãã¼ï¼
- è³¼å ¥: 47人 ã¯ãªãã¯: 1,166å
- ãã®ååãå«ãããã° (374件) ãè¦ã
ä½ã¯ã¨ãããä½é¨ãã¦ã¿ã
MapReduceã®ãªã¼ãã³ã½ã¼ã¹ã®å®è£ ã§ããHadoopã¨ãããã®ãããäºã¯ç¥ã£ã¦ãã¦ãWEB+DB PRESS vol.47 & vol.48ã§id:naoyaãããæ¸ããã¦ããè¨äºãèªãã§ã¯ãããã®ã®ãããã£ã±ã触ã£ã¦ã¿ãªãã¨åãããªããã³ãããããããã¨ããäºã§ãå®éã«è§¦ã£ã¦ã¿ãäºã«ã
CodeZineさんの記事ãåèã«ããã¦ããããªãããHadoopãã¤ã³ã¹ãã¼ã«ãã¦ãåèªæ°ã«ã¦ã³ãã®ãµã³ãã«ãä½é¨â¦ã
ããã»ã¨ãã©ãã®ã¾ã¾ãªãã£ãã ããªã®ã§ãããã§ã¯å®è¡çµæã¯çç¥â¦ãç°å¢æ´ããã ãã ã£ããããã®ãããç°¡åã§ããã
ã§ãJavaã¨ããããªããªããªæ¸ããªãã(´ã»Ïã»ï½)
Hadoopã¯Javaã«ããå®è£ ãªã®ã§ãMapé¢æ°ã»Reduceé¢æ°ã¯Javaã§æ¸ãã®ãæ®éã§ãããâ¦ãPHPerãªãã ãããã£ã±ãPHPã§æ¸ãããããããã¨ãã訳ã§ããã£ã¨æ¬é¡ã
Hadoopã«ã¯ãæ¨æºå ¥åºåã使ã£ã¦ä»»æã®è¨èªã§Mapé¢æ°ã»Reduceé¢æ°ãæ¸ãäºãã§ããããã«ãªãã"HadoopStreaming"ã¨ããæ¡å¼µãããã¾ãã
ãã¡ãã詳細ã¯ãid:naoyaããã®è¨äºã«é ¼ãäºã«â¦ã
âHadoop Streaming - naoyaのはてなダイアリー
ã¤ã¾ãã¯ãSTDINãåãåã£ã¦ãkeyã¨valueãã´ãã§ã´ãã§ãã¦æ´å½¢ãã¦è¿ãããã°ã©ã ã§ããã°ãã©ããªè¨èªã§ãMapReduceã§ãã¡ããã£ã¦è¨³ã§ããªã
ã¨ãã訳ã§Map/Reduceé¢æ°ãPHPã§æ¸ãã¦ã¿ã
対象ã®å¦çã¯ãä¾ã«ãã£ã¦ç©ºç½åºåãã§ã®åèªæ°ã®ã«ã¦ã³ãã§ããã¾ãã¯Mapé¢æ°ããã
#!/usr/bin/php <?php while ( !feof(STDIN) ) { $line = trim(fgets(STDIN)); foreach ( preg_split('/\s+/', $line) as $word ) { if ( $word !== '' ) { echo "${word}\t1\n"; } } } ?>
ç¶ãã¦Reduceé¢æ°ã
#!/usr/bin/php <?php $count = array(); while ( !feof(STDIN) ) { $line = trim(fgets(STDIN)); if ( $line !== '' ) { list($key, $value) = preg_split('/\t/', $line); $count[$key]++; } } foreach ( $count as $key => $value ) { echo "${key}\t${value}\n"; } ?>
ããåæ£!
æºåãã§ããã¨ããã§ãå®éã«åããã¦ã¿ã¾ãããã
ãµã³ãã«ã«ä½¿ãææ¸ã¯ããç´æã®"hoge"ã¨ã"fuga"ã¨ãã§ãè¯ãã£ããã§ããããã£ããåæ£ç°å¢ãªè¨³ã ããå¤å°æåæ°ã®å¤ãç©ããµã³ãã«ã«ãã¦ã¿ãäºã«ãã¾ããã
ã¨ãã訳ã§ãé空æ庫ãã"æ輩ã¯ç«ã§ãã"ã®ããã¹ããæã£ã¦ãã¦ãäºåã«MeCabã§åãã¡æ¸ããããã®ãç¨æãã¦å ¥åææ¸ã«ãã¦ã¿ã¾ããã*1
$ bin/hadoop jar contrib/streaming/hadoop-0.18.3-streaming.jar \ -input inputs/wagahaiwa_nekodearu_wakati.txt -output outputs \ -mapper 'php /home/hadoop/bin/mapreduce/php/wordcount/map.php' \ -reducer 'php /home/hadoop/bin/mapreduce/php/wordcount/reduce.php'
å®è¡æã®åºåå 容ã¯çç¥ãã¾ãããã¿ã¤ã ã¹ã¿ã³ããè¦ã¦ã¿ããå®è¡ã«æãã£ãæéã¯38ç§â¦ã
ã¾ããå ã®ãã¡ã¤ã«ã1.4MBytesç¨åº¦ã®å°ããªãã¡ã¤ã«ãªã®ã§ãHadoopã®ãªã¼ãã¼ãããã®æ¹ã大ããã¦ãæéçæ©æµãåããã®ã¯ç¡çã£ã¦ããã§ããªâ¦ã*2
ã¡ãªã¿ã«ãæåã¡ãã£ã¨ã¯ã¾ã£ã¦ãã¾ã£ããã§ãããmapperã¨reducerã§æ¸¡ããã¡ã¤ã«åã¯ãã«ãã¹ã§æ¸ãã¦ãããªãã¨ãã¡ãªããã§ãããããã¯ã"-file {ãã¡ã¤ã«å}"ã¨ãããªãã·ã§ã³ã使ãã¨ãåãã¼ãã«ãã¡ã¤ã«ã転éãã¦ããå¦çãã¦ãããã®ã§ããã®å ´åã¯ãmapper/reducerã¯ãã¡ã¤ã«åã ãã§ããããã£ã½ãã§ãã
ã§ãã¡ããã¨çµæãå¾ããã¦ãããè¦ã¦ã¿ã¾ããããHadoopã®å®è¡çµæã¯åæ£ãã¡ã¤ã«ã·ã¹ãã ä¸ã«ä¿åãããã®ã§ãä¸æ¦ãé常ã®ãã¡ã¤ã«ã·ã¹ãã ä¸ã«æ¸ãåºãã¦ãããå 容ãè¦ã¦ã¿ã¾ãã*3
$ dfsget outputs/part-00000 ~/outputs/wagahaiwa_nekodearu_wordcount.txt $ sort -nrk 2 ~/outputs/wagahaiwa_nekodearu_wordcount.txt | head -20 ã® 9333 ã 9214 ã 9213 ã 7487 ã 6808 ã« 6805 㦠6706 㯠6570 ã 6067 㨠5629 ã 5530 ã 4118 㧠3867 ã 3614 ã 3249 ã 2648 ã 2609 ã 2456 ãªã 2413 ãã 2102
è¦äºãªã¾ã§ã«ã"ã¦ã«ãã¯"ãä¸ä½ãå ãã¦ãã¾ããã(ç¬) ä»åã¯åç´ãªåãã¡æ¸ããå ã«ãã¾ããããåè©ã ãæãåºãããããã¨ããã£ã¨é¢ç½ãçµæãå¾ããããããããªãã§ããã
ãããå®ã¯è½ã¨ãç©´ãâ¦
ä»åããããã ãªç°å¢ã§å®é¨ãã¾ããããå ¥åãã¡ã¤ã«ããã£ã¨è¶ 巨大ãªãã®ã ã£ãã¨ãã¦ããã®ã¾ã¾å¿ç¨ãå©ãã®ãç´ æ´ãããã§ããã
â¦ã¨ãé常ã«ç´ æ´ãããHadoopStreamingãªãã§ãããå®ã¯ä¸ç¹è½ã¨ãç©´ããã£ã¦ãid:naoyaãããææããã¦ãããã§ãããHadoopStreamingã¯Reducerã¸ã®å ¥åãæ§é åããã¦ããªãã¨ããåé¡ç¹ãããã¾ãããã®è¾ºãã«ã¤ãã¦ã次åããã¡ãã£ã¨è§¦ãã¦ã¿ããã¨æãã¾ãã
ã¨ãã訳ã§ãããã¡ãã£ã¨ã ãç¶ããããâ¦ã