Asakusa on Spark
Asakusa on Spark
AsakusaãSparkä¸ã§åãããã«ãªãã¾ããã
Asakusa on Spark (Developer Preview) — Asakusa Framework Developer Preview 0.2.2 documentation
ãã§ã«å®éã«æ¬çªã«å©ç¨ãã¦ãã¾ãã
ノーチラス・テクノロジーズがさくらインターネットにAsakusa Frameworkで開発した大規模データの高速処理基盤を導入し、顧客単位での精度の高い原価計算を実現高速処理基盤はApache Spark™で構築 | NAUTILUS
OSSã¨ãã¦ã®å
¬éãè¡ãã¾ããã®ã§ãå
容ãä½ç½®ã¥ããã¾ã¨ãã¦ããã¾ããä¾ã«ãã£ã¦ãã¼ãã©ã¹ã¯ç¤¾å
ã§ããããªæè¦ã¯å½ç¶åºã¦ãã¾ãããä»åã¯æ¦ãä¸è´ãã¦ããæãã§ãã
- ããã©ã¼ãã³ã¹
æ¦ããæ¥åãããå¦çã¨ãã観ç¹ã§è¦ãã°ããã¹ãããHadoopMapReduceãããSparkã®ã»ããé«éã«å¦çãçµãããã¨ãã§ãããã¨ããã®ãçµè«ã«ãªã£ã¦ãã¾ãããã¼ãã©ã¹ã®æã£ã¦ããã¦ã¼ã¹ã±ã¼ã¹ã§ã¯ãã»ã¼ãã¹ã¦ã®ã±ã¼ã¹ã§ããã©ã¼ãã³ã¹æ¹åãè¦ããã¾ããããããã3åãã5åã®ç¨åº¦ã®å®è¡æéã®ç縮ãéæããã¦ãã¾ããããã¯ä¸åº¦ã®ãããã§å¦çãããã¼ã¿ãµã¤ãºããæ¦ãæ°ç¾GByte以ä¸ç¨åº¦ã§ããã¤ãããªãã®ç¨åº¦ã§è¤éãªå¦çã«ãªã£ã¦ããã±ã¼ã¹ã§ã®ããã©ã¼ãã³ã¹ã«ãªã£ã¦ãã¾ãããããã«é常ã«åç´ãªå¦çï¼ä¾ï¼åç´éè¨ä¸çºã ãï¼ã§ãä¸åº¦ã«å¦çãããã¼ã¿ãµã¤ãºãæ¦ã500Gbyte以ä¸ã®ã±ã¼ã¹ã§ã¯ãMapReduceã«è»é ããããã¨æãã¾ããããã®ã±ã¼ã¹ã§ããSparkããããªãã®ããã©ã¼ãã³ã¹ãåºãã¤ã¤ããã¾ãã
ãªãSparkã«ããã©ã¼ãã³ã¹åªä½ãããããã¨ããã¨ãããã¯å²ã¨åç´ã«HadoopMapReduceã®ãªã¼ãã¼ãããã大ããããããSparkã§ã¯åãé¤ãã¦ããã¨ãããã¨ã«ã¤ããã§ããããå··ã§ã¯ï¼ã¨ãããSparkã®å ¬å¼ãµã¤ãã§ã¯ï¼ãªã³ã¡ã¢ãªã¼å¦çã«ããããã©ã¼ãã³ã¹ãåºã¦ããã¨ãããã¦ãããå¤ãã®ã§ãããå®éã¯ããã§ãããã¾ããããã£ã¨ãIOã³ã¹ãã®ããããã¼ãéã®ã·ã£ããã«ãã¼ã¿è»¢éæã®ãã£ã¹ã¯å¼·å¶æ¸ãåºãã¯HadoopMapReduceã¨åãã§ãããåãããã«ã³ã¹ãããããã¾ããSparkã§ããã¨ããã®ããªã³ã¡ã¢ãªã¼å¦çã¨ããã®ã¯ãç¹°ãè¿ãå¦çã®ã¨ãã«æ示çã«ãã£ãã·ã¥ã使ããã¨ãããã¨ã ã¨æãã¾ãããç¾å®ã®å¦çã§ã¯åããã¼ã¿ã®ç¹°ãè¿ãå¦çãä¸å¿ã§ã§ãã¦ããããããããããããããã§ã¯ãªããç¹°ãè¿ãå¦çã§æ示ãã£ãã·ã¥ã使ããããã¨ãã£ã¦ããã©ã¼ãã³ã¹ãä¸è¬ã«10åã¨ãã«ãªããããããã¾ãããå·éã«èããã°æ®éã«ããããã¨ããã話ãªã®ã§ããããã¬ã³ãã£ã¼ãªæè¡ãä¼ç¤¾ããVCãããéãéããã¨ãã«ã¯ããããã¹ãã¬ããã®è©±ãªã®ã§ãã¾ãä»æ¹ããªã話ã§ãããã
課é¡ã«ãªã£ã¦ããHadoopMapReduceã®ãªã¼ãã¼ãããã¨ã¯ä½ãã¨ããã¨ãèªåã®è¦ãã¨ããã§ã¯å¤§ããäºã¤ãã£ã¦ãä¸ã¤ã¯ãã¹ã¦ã®å¦çãç¡çç¢çMapã»Reduceã®å½¢ã«ããªããã°ãªããªããã¨ããç¹ã§ããããã¯ããè¨ãããã¨ããã§ã¯ããã¾ãããHadoopMapReduceã§ã¯ãã¹ã¦ã®å¦çãMapã»Reduceã®å½¢ã«å¤å½¢ãããã¤ãã®é åºã§å®è¡ããå ´åã¯ã©ããã¦ãç¡é§ãçºçãã¾ãããã¡ãããããã¯Map/Reduceã®å½¢ã«ãããã¨ã«ãã£ã¦ä¸¦ååæ£å¦çãå®è¡ãããããã¨ããã¡ãªããã®è£è¿ãã§ããããããããªå¦çããã¦ããããã«ãªã£ã¦ããã¨ããããã«ãã¡ãªãããç®ç«ã¡ã¾ããäºã¤ç®ã¯Mapã»Reduceã®ã¿ã¹ã¯å¦çãå®æ ã¨ãã¦ã¯ãããããç¬ç«ããjvmã¢ããªã±ã¼ã·ã§ã³ã«ãªã£ã¦ããç¹ã§ããMapã»Reduceã®ã¿ã¹ã¯ãè¡ããããã³ã«ã¢ããªã±ã¼ã·ã§ã³ã®èµ·åã»çµäºãè¡ãããããã§ãjvmã®åå©ç¨ãªãã·ã§ã³ãããã¨ã¯ããããã®ãªã¼ãã¼ãããã¯ããªã大ããã
ä¸è¨ã®äºç¹ã¯å¤§è¦æ¨¡ãã¼ã¿ã«å¯¾ããåç´å¦çã§ããã°ãããã»ã©ã³ã¹ãã«ãªãã¾ããããDAGãã¼ã¹ã§1000ã¹ãã¼ã¸ãè¶
ãããããªã±ã¼ã¹ã§ããã°ãä¸æãããã¨ç·ã³ã¹ãã®5å²ããã®ãªã¼ãã¼ãããã«ãªããã¨ãããã¾ããSparkã§ã¯ãä¸è¨ã®èª²é¡ããããã«ã¨ãã¯ãã£ã¦ãã¾ããããªãã¡ãå¦çãç¡çã«Mapã»Reduceã®å½¢ã§ã®é åºå®è¡ããã¦ããããã§ããªããã¾ãã¸ã§ãå®è¡èªä½ãæ®éã«ä¸ã¤ã®ã¢ããªã±ã¼ã·ã§ã³ã¨ãã¦ç®¡çãã¦ãã¾ãããããã£ã¦ãåç´ã«HadoopMapReduceããSparkã«å¤æ´ããã ãã§ããªã¼ãã¼ãããã®ã³ã¹ããåæ¸ããããã¨ã«ãªã£ã¦ãã¾ã£ã¦ãã¾ãã
- HadoopMapReduceã®çµç
ããªãã®å¤§è¦æ¨¡ãªãã¼ã¿å¦çã®ã»ã°ã¡ã³ããé¤ãã¦ãç¾ç¶ã®Sparkã¯HadoopMapReduceã¨æ¯è¼ããéãã«ããã¦ã¯ãã»ã¼ä¸æ¹çã«åªãã¦ããã¨è¨ã£ã¦è¯ãã§ããããä»å¾ã¯HadoopMapReduceã§åãã¦ããã»ã¼å¤§é¨åã®æ¥åç³»ã®å¦çã¯Sparkã«ï¼å¯è½ã§ããã°ï¼ç§»è¡ããã¨æãã¾ããHadoopãç»å ´ããæåã§ã¯ã大è¦æ¨¡ãã¼ã¿ï¼ç¹ã«weblogãlifelogï¼ã®ä¸æéè¨ã主è¦ç¨éã§ããããç¾ç¶ã§ã¯ããã¼ã¿ã«å¯¾ããæä½ã¯åç´ãªGroupByã§ã¯ãªãããã¾ãã¾ãªæä½ãè¦æ±ããã¤ã¤ããã¾ãããã®ãããªç¶æ³ã§ãããã©ã¼ãã³ã¹ã§è¦ãã°ãHadoopMapReduceã§åããç¶ãããã¡ãªããã¯ä½ãã«ã大ãããMapReduceã«å¯¾ããæ¹åã¯ãã§ã«å°è¦æ¨¡ãªãã®ã®ã¿ã¨ãªã£ã¦ãããSparkããæ¯ã¹ã¦å¦çæéã3ã5åããããã¨ãéã¿ã¦ããHadoopMapReduceä¸ã§ã®ã¢ããªã±ã¼ã·ã§ã³ã«ç¶ç¶æè³ãè¡ããã¨ã¯ãã¾ãã«ãå¹çãæªãããã¡ãã大éãã¼ã¿ã®åç´éè¨ã§ããã°ãã¾ã ã¾ã HadoopMapReduceã«åãããã®ã§ããã®ãããªä»çµã¿ã¯ãã®ã¾ã¾éç¨ãã¦ããã°ãã話ã§ãç¡çããSparkã«ç§»è¡ããå¿ è¦ããªãã§ãããã
ç¹ã«ãæã
ããã©ã¼ã«ã¹ãã¦ãããããªãç¹ã«ä¼æ¥ã®æ¥åç³»ã®ãããå¦çã§ã¯HadoopMapReduceãå©ç¨ããã¡ãªããã¯ã»ã¼ã¼ãã§ãããã¼ã¿ãµã¤ãºãå¦çã®è¤éæ§ããè¦ã¦ããSparkã«è»é
ããããã¾ããä»®ã«ãHadoopMapReduceããSparkã«ç§»è¡ã§ããªãã¨ããã°ãããã¯ãã¾ãã«å¤§éã®å¦çã裸ã®MapReduceã§æ¸ããããã¨ãããã¨ã«ãªãã®ã§ã¯ãªãã§ããããï¼ã4å¹´åã®Asakusaã®ãªãªã¼ã¹å½åãããMapReduceã裸ã§æ¸ããã¨ã¯ã¢ã»ã³ãã©ã§å¦çãè¨è¿°ãããã¨ã¨ããã»ã©å¤§å·®ã¯ãªããã»ã©ãªãã¡ã³ãã§ããªããªãã ãããããã¡ãªãããã©ãã©ã大ãããªãã¾ããã¨ãããã¨å£°ã大ã«ãã¦è¨ã£ã¦ããããç¾å®ã¨ãªãã¤ã¤ããæããã¾ããHadoopMapReduceã¯ãã»ã¼ãããããã¬ã¬ã·ã¼ãè³ç£ã«ãªãã¤ã¤ããã¾ãã
- Sparkã¯ç¡æµãªã®ãï¼
ã§ã¯Sparkã¯ããã»ã©ç¡æµãªã®ãï¼ã¨ãããã¨ã§ãããå½ããåã§ãããç¡æµã§ã¯ãªãã§ããï¼ãããHadoopãéã ããã ã£ãã¨ããæ¹ãæ£ããï¼
ã»è¨å®ããã¥ã¼ãã³ã°ãé¢å
ç¾ç¶ã§ã¯ããã¦ãã®å ´åã¯ãããã©ã¼ãã³ã¹ãåºãã«çç ãããã¨ãå¤ãã§ããããå ãã¦å®å®ãã¦ç¨¼åãããã«ã¯çµé¨ãå¿
è¦ã§ããããã¯ããHadoopMapReduceã®ã¡ãªããï¼è¨å®ãåæ£å¦çåºç¤ã®ããã«ã¯ç°¡åï¼ã¨ãã¡ãªããï¼ããã©ã¼ãã³ã¹ã®ä¸éãå²ãã¨ç°¡åã«åºã¦ãã¾ãï¼ã®äº¤æã«è¦ãã¾ããSparkã§ã¯ããã©ã¼ãã³ã¹ãåºãããã«ãã¢ã¬ãã³ã¬ããã©ã¡ã¼ã¿ãè¨å®ãã¦ãããã©ã¼ãã³ã¹ãåºãããã«è©¦è¡é¯èª¤ããå¿
è¦ãããã¾ãããããã¢ããªã±ã¼ã·ã§ã³ãã¨ã«è¡ãå¿
è¦ãããã¾ããããã«Yarnã§å®è¡ãããã«ã¯ãYarnã®è¨å®ãéè¦ã«ãªããã¨ãããã¨ã§ãããã¯ç¢ºãã«é¢åã§ããåºç¯å²ã«ããã£ã¦æ´åæ§ã®ã¨ããè¨å®ãå¿
è¦ã§ãããããã¯ãªããªãé£æ度ãé«ãã§ããå¾ã
ã«ãã¦ãã¦ãå
±æåããã¦ãã¾ãããã§ãããã¨ãå¢ããã¨ãããã¨ã®ãã¡ãªããã¯ã¼ãã«ã¯ãªããªãã§ããããã¨ã¯ãããããã¯ã¾ãã«æéã®åé¡ã§ããã
ã»åºæ¬çãªã¢ã¼ããã¯ãã£ã®åé¡ã
ããã¯ããããæè¦ãããã¾ããã
ã²ã¨ã¤ã¯åå¦çã®è«ççãªåºåãã¼ããã²ã¨ã¤ã§ãããã¨ã§ããã»ã¼ä¼¼ããããªããã«ã®Tezã¯è¤æ°åãã¾ãããå¦çããã©ã³ãããããããªããã¼å¶å¾¡ãè¡ãå ´åã¯ãåºæ¬çã«ç¡çãçºçãã¾ããå®éãAsakusaã®å®è¡åºç¤ã®ã¿ã¼ã²ããã¨ããã¨ããåé¡ã«ãªãã¾ãããAsakusaã§ã¯ãã®æãã®æã§è§£æ±ºãã¦ããããæ®éã«å®è£ ããã®ã¯ããªãç¡ççã«ãªãã¾ãã
ãµãã¤ãã¯Shuffleæç¹ã®å¼·å¶æ¸ãåºããè¨ãã¾ã§ããªããããã¯ãã¡ãã¡ã³ã¹ãããããã®ã§ãã¨ã¯ãããè³å¦ä¸¡è«ã§ããç¹ã«å¦çã®çµç¤ã§ãååæ¦ã§å©ç¨ãããã¼ã¿ãããä¸åº¦ä½¿ããããªå ´åãããã°ãæ¸ãåºãã¯æå¹ã§ããæ¾ã£ã¦ããã°ãã®ã¾ã¾ã¡ã¢ãªã¼ãå æãã¦ãã¾ãããããã¾ããæ¹ãã¦è¨ãã¾ã§ããªãã§ããããã¼ãã»ãã§ã¤ã«ã«ãå½ç¶å¼·ããããããRDDã®çºæ³ããããã¨ãæ¸ãåºãã¯çºæ³ã®æ ¹æ¬ã«ãããã®ãªã®ã§ãå対ãããã¤ã¯ä½¿ããªã¨ãã話ã«ããªãããã§ããï¼ãããã¼ããµã¤ãã«å ã ã¨in memory å¦çã¨ãæ¸ãããããã¼ããã¨ãæãã¾ããã»ã»ãå®éãæ©ã¨ã¡ããã¦ãã人ãå¤æ°ãã¾ããï¼
ã¨ã¯ãããã©ãã§ä½ã使ããã¯å¦çå
¨ä½ãã³ã³ãã¤ã«ããæç¹ã§ããç¨åº¦ç®æã¯ã¤ãã®ã§ãã¡ããã¨è¦æ¸¡ããä»çµã¿ãããã°ãShuffleããã¡ãã¡æ¸ãåºãã®ã¯ç¡é§ã§ãããã¨ã¯ééããªãã§ããããå¿
è¦ãªã¨ãã ãæ¸ããã°ååã ã¨æãã¾ããä»å¾ã®ã¡ã¢ãªã¼ã®å®¹éã¯å¤ªé½ç³»ã®å¤§ããï¼æ¯å©ï¼ã¾ã§åºããå¯è½æ§ãããã®ã§ãplannerãè³¢ããªã£ãããã£ããããåæ£å¦çã®ç®¡çåºç¤ãã§ã¦ããã°ï¼ã¾ã ãªããã©ï¼ãã¶ãSparkã¯ããã¨ãã£ããè² ãã¾ããã»ã»ã»å½åå
ã ã¨æãã¾ãããå
·ä½çã«ããã¨Rack Scale Architectureãã¼ã¹ã®ã¡ãã¼ã³ã¢åæã®ãå²ã¨ãã£ã¡ãããå¦çåºç¤ãåºãã¨ãã«ã¯ãããªãç°¡åã«æ°´ã空ããããããªãã»ã»ã»ã¨ãæãã¾ãããã¡ããããã®å ´åã¯Hadoopã¯æåéã象ãªã¿ã®ã¹ãã¼ãã®æ±ãã«ãªãã¨ã¯æãã¾ããã
- Asakusaèªä½ã®å¤§å¹ ãªã¢ã¼ããã¯ãã£ã®å¤æ´
ãã¦ãä»åã¯Asakusaã®éçºé£ã®é å¼µãããã£ã¦ãããªãå¤§å¹ ãªã¢ã¼ããã¯ãã£ã®å¤æ´ãè¡ããã¦ãã¾ããå¾åã®Asakusaã®ã³ã³ã»ãããç¶æãã¤ã¤ãã³ã³ãã¤ã©ãã»ã¼å ¨é¢çã«æ¸ããªããã«ãªã£ã¦ãã¾ããå¾æ¥ã®Asakusaã¯AsakusaDSLããæé©ãªMapReduceããã°ã©ã ãçæããä»çµã¿ã«ãªã£ã¦ãã¾ããããæ°ããã³ã³ãã¤ã©ã¯ä¸æ¦ãDAGã®ä¸éæ§é ãçæãããã®DAGã®ä¸éãã¼ã¿ããSparkã«æé©ãªãã¤ãã³ã¼ããçæããä»çµã¿ã«ãªã£ã¦ãã¾ããã¤ã¾ããä»å¾Spark以å¤ã®ãããé«éãªDAGã®å®è¡ã¨ã³ã¸ã³ããåºã¦ããã¨ãã«ã¯ãSparkã®ãã¤ãã³ã¼ãçæé¨åã®ã¿ãå ¥ãæ¿ããã ãã§ãæ°ããªå®è¡ç°å¢ã«å¯¾å¿ãããã¨ãå¯è½ã«ãªã£ã¦ãã¾ããç¾ç¶ããã¼ãåæ£ç°å¢åã§ã®å®è¡å½¢å¼ã®æ¨æºãDAGã«ãªãã¤ã¤ããã®ã¯ãå¨ç¥ã®éãã§ãããDAGã®å®è¡ã¨ã³ã¸ã³ã¯TezãFlinkãè¦ãã¾ã§ããªããä»å¾ãããããåºã¦ããã ããã¨ã¿ã¦ãã¾ããAsakusaã¯ãã®ã¨ã³ã¸ã³ã«å¯¾å¿ããããããããã«ãä»åæ ¹æ¬ã®ã¢ã¼ããã¯ãã£ãåæ§ç¯ãã¦ãã¾ãã
ããã¯ãAsakusaã®æå³åãã®å¤åãããããã¨æã£ã¦ãã¾ããã¦ã¼ã¶ã¼ã¯AsakusaDSLã§å¦çãè¨è¿°ãã¦ããã°ãä»å¾ãããé«éãªå¦çã¨ã³ã¸ã³ãéçºããã¦ããã¨ãã¦ããAsakusaã対å¿ãããããã°ããã®ã¡ãªããã享åã§ããã¨ãããã¨ã«ãªãã¾ããç¹ã«ã½ã¼ã¹ã³ã¼ãã»ãã¹ããã¼ã¿ããã¾ã£ããå¤æ´ãããã¨ãªããã«æ°ããªé«éç°å¢ã«ç§»è¡ã§ããã¨ããã¡ãªããã¯é常ã«å¤§ãããå人çã«ã¯ããã¹ããã¼ã¿ã»ãã¹ãç°å¢ããã®ã¾ã¾æã£ã¦ãããã®ã¯ã大ããã¨æã£ã¦ãã¾ããããã¯æ¥åç³»ã¢ããªã±ã¼ã·ã§ã³ã®æè³å¯æ¬æ§ã大å¹
ã«åä¸ããããã¨ã«ãªãã¾ããAsakusaã«ããæ¥åã·ã¹ãã æè³ãµã¤ã¯ã«ã¨åæ£ç°å¢ã®ã©ã¤ããµã¤ã¯ã«ã®ã®ã£ãããåãããã¨ãå¯è½ã«ãªãã¨æã£ã¦ãã¾ãã
- ä»å¾
ã¾ããããªæãã§ããæã ã¨ãã¦ã¯ãBetter Hadoopã¨ãã¦Sparkãå©ç¨ãã¦ãããã¨ããã®ã¨ãSparkã«å¯¾å¿ããã®ã¨åæã«ããã®å ãè¦ãªããAsakusaã®ã¢ã¼ããã¯ãã£ãäºå®éãã«å¤ãã¾ãããã¨ããã¨ããã§ããæ£ç´ãã¼ã¹ã§ãSparkãããªãéãã§ããäºæ³ãããããã§ããã»ã»ã»
Sparkãã¯ãããä»å¾ã®åæ£ç°å¢ã¯é«éåãã¾ãã¾ãé²ãã§ããããããããã¨ã§ãããã¨ãå¢ããã¨æãã¾ãããã¼ã¹ãã¼ã¿ã¬ã¤ã¤ã¼ã¨ãã¦ã®HDFS-APIã¯éæ¿ã ã¨æãã¾ãã®ã§ãHDFSäºæã«ãã¼ã¿æºãã¦ããã¦ãå¦çåºç¤ã ãã»ãã»ãåãæ¿ããã¨ããã©ã¼ãã³ã¹ãåæã«ãããã¨ããæãã«ãªãã§ããããã¨ã¯ãããã¢ããªã±ã¼ã·ã§ã³ã¬ã¤ã¤ã¼ããè¦ãã¨ã·ã¹ãã æè³ã®ã©ã¤ããµã¤ã¯ã«ã¨å®è¡åºç¤ã®ã©ã¤ããµã¤ã¯ã«ã®ã®ã£ãããé²è¡ãã¾ããããã®è¾ºãåããã®ãAsakusaã£ã¦æãã«ãªãã¨ãããªãã¨æã£ã¦ãã¾ãã
ã»ã»ã»ã£ã¦æ°ãã¤ãããä¸å¹´ãããæ¾ç½®ãã¦ãã®ã§ãã¡ãã£ã¨åçãã¦ã¾ããæããã¸ã§ã¯ãã«ã©ã£ã·ãã ã£ãã®ã§ã»ã»ã»
ä»å¾ã¯ã§ããã°2ã¶æã«1åãããã¯æ´æ°ãããã¨æãã¾ãããã¿ã¾ããã