Apache Sparkã£ã¦ã©ããªãã®ãè¦ã¦ã¿ãï¼ãã®ï¼
ããã«ã¡ã¯ã
Kafkaã試ãã¦ããæä¸ã§å¾®å¦ã§ãããæè¿ä½¿ããã®ããªããã¨æ å ±ãéãã¦ããã®ããApache Sparkãã§ãã
MapReduceã¨åããåæ£ä¸¦è¡å¦çãè¡ãåºç¤ãªã®ã§ãããMapReduceãããæ°ååéãã¨ãã®æ
å ±ãããã¾ãã
ã»ã»ã»ããªé¿åãªãã¨ãæã£ãã®ã§ãããå
é¨ã§ä¿æãã¦ããRDDã¨ããä»çµã¿ãé¢ç½ããã¨ãããã
ã¨ããããè³æãè«æãèªãã§ã¿ããã¨ã«ãã¾ããã
ã¾ãè¦ã¦ã¿ãè³æã¯ãOverview of Sparkãï¼http://spark.incubator.apache.org/talks/overview.pdfï¼ã§ãã
ã¨ããããã§ãèªãã çµæãã¾ã¨ãã¦ã¿ã¾ãã
Sparkã¨ã¯ï¼
é«éã§ã¤ã³ã¿ã©ã¯ãã£ããªè¨èªçµ±åã¯ã©ã¹ã¿ã³ã³ãã¥ã¼ãã£ã³ã°åºç¤
Sparkããã¸ã§ã¯ãã®ã´ã¼ã«ã¯ï¼
- 以ä¸ã®2ã¤ã®è§£æã¦ã¼ã¹ã±ã¼ã¹ã«ããé©åããããMapReduceãæ¡å¼µãããã¨ã
1.ã¤ã³ã¿ã©ã¯ãã£ããªã¢ã«ã´ãªãºã ï¼æ©æ¢°å¦ç¿ãã°ã©ãæç»ï¼
2.ã¤ã³ã¿ã©ã¯ãã£ããªãã¼ã¿ãã¤ãã³ã°
- ããããã°ã©ã ãããããããã¨
1.Scalaã¸ã®çµ±å
2.Scalaã¤ã³ã¿ããªã¿ãç¨ãã¦ããåçã«å®è¡ãããããã
Sparkãéçºããã¢ããã¼ã·ã§ã³
ç¾å¨åºåã£ã¦ããã¯ã©ã¹ã¿ã³ã³ãã¥ã¼ãã£ã³ã°ã¢ãã«ã¯ãã¹ãã¬ã¼ã¸ããåå¾âçµæãã¹ãã¬ã¼ã¸ã«åºåãã¨ãªã£ã¦ãããé循ç°çã
ï¼ä½ããã¹ãã¬ã¼ã¸ã«åå¨ããããã«ã©ãã§ã¿ã¹ã¯ãå®è¡ãããããã¨ã©ã¼å¾©æ§ã¯èªåã§ããããããªã£ã¦ãã¯ããã
ã ããä¸è¨ã®ã¢ãã«ã¯ãã¼ã¿ã»ãããç¹°ãè¿ã使ããããªä»¥ä¸ã®ã¢ãã«ã«é©ç¨ããã«ã¯éå¹çã
ãªããªããç¾ç¶ã®ãã¬ã¼ã ã¯ã¼ã¯ã¯ã¯ã¨ãªå®è¡ãã¨ã«æ¯åã¹ãã¬ã¼ã¸ãããã¼ã¿ããã¼ããã¦ããããã
1.ã¤ã³ã¿ã©ã¯ãã£ããªã¢ã«ã´ãªãºã ï¼æ©æ¢°å¦ç¿ãã°ã©ãæç»ï¼
2.ã¤ã³ã¿ã©ã¯ãã£ããªãã¼ã¿ãã¤ãã³ã°ï¼RãExcelãPythonçï¼
ä¸è¨åé¡ã¸ã®è§£ï¼Resilient Distributed Datasets(RDDs)
ç¹°ãè¿ãå©ç¨ãããã¼ã¿ã«ã¤ãã¦ã¯ã¡ã¢ãªä¸ã«ä¿æãããã¨ãå¯è½ãªæ©æ§ã
MapReduceãä¿æãã¦ãã以ä¸ã®åªããç¹æ§ã¯ãã®ã¾ã¾å¼ãç¶ãã§ããã
ã»èé害æ§ããã¼ã¿å±ææ§ãã¹ã±ã¼ã©ããªãã£
ãã®ä¸ã§ãåºãã¢ããªã±ã¼ã·ã§ã³ããµãã¼ããã¦ããã
Sparkã®ããã°ã©ãã³ã°ã¢ãã«
Resilient Distributed Datasets(RDDs)ã¯ä»¥ä¸ã®æ§è³ªãæã¤ã
ã»ã¤ãã¥ã¼ã¿ãã«ã§åå²ããããªãã¸ã§ã¯ãã®ã³ã¬ã¯ã·ã§ã³
ã»ä¸¦åå¦çï¼mapãfilterãgroupByãjoinï¼ãã¹ãã¬ã¼ã¸ä¸ã®ãã¼ã¿ã«é©ç¨ããçµæçæ
ã»åå©ç¨ããããã«ã¡ã¢ãªä¸ã«ãã£ãã·ã¥ããã
RDDsã«å¯¾ãã¦å®è¡å¯è½ãªã¢ã¯ã·ã§ã³
ã»Countãreduceãcollectãsaveã»ã»ã»
- å®è¡ä¾ï¼ãã°ãã¤ãã³ã°
ãã°ãã¡ã¤ã«ããã¨ã©ã¼ã¡ãã»ã¼ã¸ãã¡ã¢ãªä¸ã«ãã¼ãããæ§ã
ãªãã¿ã¼ã³ã«å¯¾ãã¦ã¤ã³ã¿ã©ã¯ãã£ããªæ¤ç´¢ãå¯è½ã¨ããã
以ä¸ã®ãããªå¦çã1TBã®ãã¼ã¿ã«å¯¾ãã¦è¡ãéã«ãã¹ãã¬ã¼ã¸ä¸ã®ãã¼ã¿ãç¨ããå ´åã«ã¯170ç§ãããããRDDãç¨ããæ¹å¼ã ã¨5ã7ç§ã§å®è¡å¯è½ã
lines = spark.textFile(âhdfs://...â) // ãã°ãã¡ã¤ã«ããã¼ãï¼Base RDDï¼ errors = lines.filter(_.startsWith(âERRORâ)) // ERRORã®è¡ã§å§ã¾ããã¼ã¿ãæ½åºï¼Transformed RDDï¼ messages = errors.map(_.split(â\tâ)(2)) // ã¿ãæåã§åå²ãã cachedMsgs = messages.cache() // ã¿ãæåã§åå²ããã¡ãã»ã¼ã¸ããã£ãã·ã¥ cachedMsgs.filter(_.contains(âfooâ)).count // ãã£ãã·ã¥ããã¡ãã»ã¼ã¸ã®ä¸ã§"foo"ãå«ãæ°ãåå¾ cachedMsgs.filter(_.contains(âbarâ)).count // ãã£ãã·ã¥ããã¡ãã»ã¼ã¸ã®ä¸ã§"bar"ãå«ãæ°ãåå¾
RDDã®èé害æ§
RDDã¯ãã¼ã¿ã®æ¬ æãçºçããéã®åæ§ç¯ã«ä½¿ç¨ããããã®ãã¼ã¿ã®çµç·¯ï¼ï¼ï¼ãä¿æãã¦ããã
ã©ã®ãã¡ã¤ã«ããã©ãçæããããæ®ã£ã¦ãããããé害çºçæã«ãåæ§ç¯ãå¯è½ã
- å®è¡ä¾ï¼ãã¸ã¹ãã£ãã¯å帰
ã´ã¼ã«ï¼ï¼ã¤ã®ãã¼ã¿ç¾¤ããã¾ãåºåãç·ãè¦ã¤ãããã¨
Sparkã§å®è¡ããéã®ã³ã¼ãã¯ä»¥ä¸ã®éãã
val data = spark.textFile(...).map(readPoint).cache() // ãã¼ã¿ããã¼ãããã¡ã¢ãªä¸ã«ãã£ãã·ã¥ var w = Vector.random(D) // ã©ã³ãã ãªVectorãçæ for (i <- 1 to ITERATIONS) { // ã¤ãã¬ã¼ã·ã§ã³åæ°ã ãå帰å¦çãå®æ½ val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _) w -= gradient } println("Final w: " + w)
ã¤ãã¬ã¼ã·ã§ã³æ°ãå¤åãããå ´åã®å®è¡çµæã¨ãã¦ã¯ä»¥ä¸ã
Hadoopã¨æ¯ãã¦Sparkã¯ååã®ãã¼ãããéããããã以éã®ã¤ãã¬ã¼ã·ã§ã³é度ãåä¸ãããããå帰ããã¾ã§å¦çããã»ã»ã»
ã¨ããã±ã¼ã¹ã«ããã¦ã¯é常ã«é«éã«ãªãã
Sparkã®ã¢ããªã±ã¼ã·ã§ã³ä¾
ã»ã¤ã³ã¡ã¢ãªã«å¯¾ããHiveã«ãããã¼ã¿ãã¤ãã³ã°ï¼Convivaï¼
ã»äºæ¸¬åæï¼Quantifindï¼
ã»å¸è¡ã®ãã©ãã£ãã¯äºæ¸¬ï¼Mobile Millenniumï¼
ã»Twitterã®Spanå¤å®ï¼Monarchï¼
ã»è¡åå åå解ã«ããå調ãã£ã«ã¿ãªã³ã°
Sparkã®ã¢ããªã±ã¼ã·ã§ã³ä¾ï¼Conviva GeoReport
以ä¸ã®æ§è³ªã«ããå¤æ°ã®ãã¼ã®çµ±åå¦çãè¡ãéã«40åã®æ§è½ãåºãã
ã»ãã£ã«ã¿ãªã³ã°ãããã¬ã³ã¼ããæªä½¿ç¨ã«ã©ã ã®åèªã¿è¾¼ã¿ãåé¿
ã»è§£åå¦çãç¹°ãè¿ãããªãããã«åé¿
ã»ãã·ãªã¢ã©ã¤ãºåããããªãã¸ã§ã¯ããã¡ã¢ãªä¸ã«ä¿æ
Sparkä¸ã«æ§ç¯ããããã¬ã¼ã ã¯ã¼ã¯
- Pregel on Spark (Bagel)
ã»Googleã¡ãã»ã¼ã¸éåä¿¡ããã°ã©ããæç»
ã»200ã©ã¤ã³ã®ã³ã¼ãã§è¨è¿°
- Hive on Spark (Shark)
ã»Apache Hiveã¨äºææ§ä¿æ
ã»3000ã©ã¤ã³ã®ã³ã¼ãã§è¨è¿°
å®ç¾æ¹æ³
ãªã½ã¼ã¹ããã¼ã¸ã¡ã³ãåºç¤Apache Mesosã®ä¸ã§åä½ããã
Hadoopãä»ã®ã¢ããªã¨åãããã«åä½ããã
HDFSçã®Hadoopã®input sourceãåæ§ã«èªããã¨ãã§ããã
Scala compilerã«ã¯ç¹ã«æãå
¥ãã¦ããªãã
Sparkã¹ã±ã¸ã¥ã¼ã©
Dryadã®ãããªé循ç°ã°ã©ããç¨ãã¦å¦ç
ã¹ãã¼ã¸ãã¨ã®ãã¤ãã©ã¤ã³ãæ§ç¯ãã¦å®è¡
ãã£ãã·ã¥ãæå¹ã«ããããã«åå©ç¨ï¼å±ææ§ãæã£ãå¦çãæ§ç¯
ã·ã£ããã«ãåé¿ããããã«ãã¼ãã£ã·ã§ã³ãç¨ããæ§æ
ã¤ã³ã¿ã©ã¯ãã£ããªSpark
以ä¸ã®å¤æ´ãScalaã¤ã³ã¿ããªã¿ã«è¡ããã¨ã§Sparkãã³ãã³ãã©ã¤ã³ããã¤ã³ã¿ã©ã¯ãã£ãã«å®è¡ãããã¨ãå¯è½ã
ã»èªã¿è¾¼ãã ã¯ã©ã¹ã®ä¾åæ§è§£æ±ºãå¯è½ã«ããããä¿®æ£
ã»çæãããã¯ã©ã¹ããããã¯ã¼ã¯è¶ãã«åé
ããããä¿®æ£
ã¡ã¢ãªãååã«ç¢ºä¿ã§ããªãå ´åã®ãµãã¾ã
ã¡ã¢ãªã足ããªãå ´åã§ããã£ãã·ã¥ãã§ããåé«éåã®æ©æµãåãããã¨ãã§ããã
é害çºçæã®å¾©æ§æé
é害ãçºçããã¤ãã¬ã¼ã·ã§ã³ã§ã¯æéããããããæåããå ´åã®1.5åã»ã©ã§å¾©æ§ã
ä»ã®ã¤ãã¬ã¼ã·ã§ã³ã«ã¯ã»ã¼å½±é¿ã¯åºãªãã
ã»ã»ã»ã¨ããããªæãã§ããã
端çã«è¨ãã¨ãã¼ã¿ããã¼ãããçµæãå¿
è¦ãªåã¡ã¢ãªä¸ã«ç¢ºä¿ããã¾ã¾å¦çãå¯è½ãªãããå¦çæéãé«éåã§ããã¨ããããã§ããã
ãã ãã¡ã¢ãªä¸ã«ç¢ºä¿ããRDDã¯é害çºçæãåé¡ãªã復æ§ãããªã©ã®è¦ç´ ãæã£ã¦ããããã§ãã
RDDã®åä½ã¯èã«ãªããããªã®ã§ãå®éã«åããå ´åã«ã¯äºåã«æ§é ãä»çµã¿ããããã¦ããå¿
è¦ãããããã§ããã
ãã¨ã¯Apache Mesosã®ä¸ã§åãã¨ããå½¢ã®ãããHadoopã¨æ¯ã¹ã¦ï¼æ®µéã¤ã³ã¹ãã¼ã«ãããå®è¡ããããã«å°å
¥ãããã®ãå¤ããªãã¾ãã
ãã®ãããã¯æ§ç¯ããéã«å¤§å¤ãããªç¹ã»ã»ã»ã¨ãªã£ã¦ãã¾ããã
ã¨ããããé«éåããçç±ãç¹å¾´ã¨ãã«ã¨ããããæ¦è¦ã¯ãããã¾ããã
ä»å¾ãã¡ããã¡ããè³æãèªãã§ã¾ã¨ãã使ãããã®ä¸æºåããã¦ãããã¨æãã¾ãã