Hapyrusã§æ°è»½ã«Hadoop MapReduceã試ã
æ¬æ¥ã®ç¤¾å åå¼·ä¼ã§ãid:a_bickyå çã«Hadoop + MapReduceã®è©±ããã¦ããã£ããé¢ç½ãã£ãããã®æã«ãæ°è»½ã«MapReduceå¦çã試ããHapyrus (https://www.hapyrus.com/) ã¨ãããµã¼ãã¹ãæãã¦ããã£ãã®ã§ã試ãã«Hapyrusã«ç»é²ãã¦(ç¾å¨Betaçããã)ã¢ããªã±ã¼ã·ã§ã³ä½ã£ã¦ã¿ãã
ç»é²ã¨ãã«ã¤ãã¦ã¯ç¹ã«é£ãããã¨ããªãããªãã®ã§ãå²æã
ã¢ããªãä½æããå¾ããã¡ããã (1) ãã¼ã¿ã½ã¼ã¹ã¨ãªãtextãã¡ã¤ã«(4000æåã¾ã§) or Amazon S3ã®ãã¼ã¿, (2) mapç¨ã¹ã¯ãªãã/reduceç¨ã¹ã¯ãªãã ã®è¨3ã¤ãç»é²ãã¦ããã°ãé²ã®ãããã®Hadoopã§MapReduceå¦çããã¦ãããããããã
ç¾å¨å©ç¨åºæ¥ãè¨èªã¯Perl/Ruby/Pythonï¼è£ã¯Hadoop Streamingãªã®ãããï¼ï¼ãä»åã¯Perlã§ä½æãã¾ãããã¼ã¿ã½ã¼ã¹ã«ã¯ãã ã¬ããã®ä¸å¹ä¸å ´ãç¶çã®äº¡éãéå ´ããã¨ããã¾ã§ã®ããã¹ããã¡ã¤ã«ã使ã£ããè¡ãå¦çã¯ãã¨ããããåæ©ã®åæ©ã¨ãã¦ãããããword countã
mapã¨reduceã®å¦çã¯ä»¥ä¸ã®ãããªæãããµã³ãã«ã®Rubyãã¿ãã¨ã©ãããã³ãã³ãã©ã¤ã³ããå ¥ã£ã¦ãããã®ãä¸ããå ¨é¨ãºããºããããããããcatã®ãããªé£ãæ¹ããã¦ã»ãããã ã£ãã®ã§ããã«åãããã
map script
use strict; use warnings; while( <> ) { my $line = $_; chomp $line; my @words = split /\b/, $line; print "$_\t1\n" for grep { $_ =~ /\w+/ } @words; }
reduce script
use strict; use warnings; my %hash; while( <> ) { my $line = $_; chomp $line; my ($word, $i) = split /\t/, $line; $hash{$word}++; } for my $key (keys %hash) { print $key . "\t" . $hash{$key} . "\n"; }
è²ã
é©å½ã ãã©æ°ã«ããªãæ¹åã§ããã¼ã¿ã½ã¼ã¹ã¯é·ãã®ã§çç¥ãã¡ãªã¿ã«é·ãã¨ã¯ãã£ã¦ã4000æå以ä¸ã®ãã®ãªã®ã§ãç¾å®åé¡ã¨ãã¦Hadoop MRã§å¦çããæå³ã®ãªããã¼ã¿ãµã¤ãºã ãã©ãã®è¾ºã¯ãã©ã¤ã¢ã«ãªã®ã§æ°ã«ããªãã
çµæ
https://gist.github.com/1074484
Hamletãé©å½ã«word countã§ãã¦ãã¿ããã§ãããçµæãsortããã¦ãªããã©ããã®è¾ºã©ããªã£ã¦ãã®ãã¾ã§è¯ã解ããã¨ã§ããã¨ãããkeyãæå®ããæ¹æ³ã¯ãããã ãããï¼
Hapyrusã¯ã¾ã ããªãã¦ããªãé¨åãå¤ãããã§ãããä»å¾ã«æå¾ ã®ãµã¼ãã¹ã§ããã¾ãã
(ä½è«)
bickyå
çããã¾ãã«åã®æ¬²ããuser scriptãæ¸ãã¦ããï¼ http://d.hatena.ne.jp/a_bicky/20110610/1307713778