Streaming Processing for Big Data
ããã°ãã¼ã¿ãæµè¡ã£ã¦ããã®ã§ï¼å¤§éãã¼ã¿ãã¹ããªã¼ãã³ã°ã«semi-realtimeã«å¦çã§ãããã¬ã¼ã ã¯ã¼ã¯ãæ¢ãã¾ããï¼
æ¡ä»¶ã¯ï¼
- å¦çã®åããã¼ããç¡ããã¨
- ã§ããã ãã·ã³ãã«ã§ãããã¨
- å種è¨èªã¸ã®ãã¤ã³ãã£ã³ã°ããããã¨
ãªã©ã§ãï¼
ãã¬ã¼ã ã¯ã¼ã¯ã¨ãã¦ã¯ï¼Dempsyï¼Stormï¼Esperï¼Streambaseï¼HStreamingï¼Yahoo S4ãªã©ãããã¾ããï¼é 次調ã¹ã¦ããã¾ãï¼
Dempsy
æ¯è¼çæ°ããï¼Dempsy.http://dempsy.github.com/Dempsy/
What is Dempsy?
In a nutshell, Dempsy is a framework that provides for the easy implementation Stream-based, Real-time, BigData applications.
Dempsy is the Nokia's "Distributed Elastic Message Processing System."
Dempsy is Distributed. That is to say a dempsy application can run on multiple JVMs on multiple physical machines.
Dempsy is Elastic. That is, it is relatively simple to scale an application to more (or fewer) nodes. This does not require code or configuration changes but allows the dynamic insertion and removal of processing nodes.
Dempsy is Message Processing. Dempsy fundamentally works by message passing. It moves messages between Message processors, which act on the messages to perform simple atomic operations such as enrichment, transformation, or other processing. Generally an application is intended to be broken down into more smaller simpler processors rather than fewer large complex processors.
Dempsy is a Framework. It is not an application container like a J2EE container, nor a simple library. Instead, like the Spring Framework it is a collection of patterns, the libraries to enable those patterns, and the interfaces one must implement to use those libraries to implement the patterns.
Dempsyã¨ã¯ï¼
ç°¡åã«è¨ãã¨ï¼Dempsyã¯ã¹ããªã¼ã åã§ãªã¢ã«ã¿ã¤ã ã®BigDataã¢ããªã±ã¼ã·ã§ã³ã®å®¹æãªå®è£ ãæä¾ãããã¬ã¼ã ã¯ã¼ã¯ã§ãï¼Dempsyã¯Nokiaã®ãåæ£åæè»åã¡ãã»ã¼ã¸å¦çã·ã¹ãã ãã§ãï¼
ã»Dempsyã¯åæ£åã§ãï¼ã¤ã¾ãï¼dempsyã¢ããªã±ã¼ã·ã§ã³ã¯è¤æ°ã®ç©çãã·ã³ä¸ã®è¤æ°ã®JVMä¸ã§å®è¡ãããã¨ãã§ãã¾ãï¼
ã»Dempsyã¯æè»åã§ãï¼ã¤ã¾ãï¼ã¢ããªã±ã¼ã·ã§ã³ãã¹ã±ã¼ã«ãããããã«ãã¼ããå¢æ¸ããããã¨ãæ¯è¼çã·ã³ãã«ã«å¯è½ã§ãï¼ã³ã¼ããè¨å®ã®å¤æ´ç¡ãã«ï¼åçã«å¦çãã¼ãã®è¿½å ãåé¤ãå¯è½ã§ãï¼
ã»Dempsyã¯ã¡ãã»ã¼ã¸å¦çã§ãï¼ã¤ã¾ãï¼Dempsyã¯åºæ¬çã«ã¡ãã»ã¼ã¸ããã·ã³ã°ã«ãã£ã¦åãã¾ãï¼ä¸è¬ã«ï¼ã¢ããªã±ã¼ã·ã§ã³ã¯å°ãªã巨大ã§è¤éãªå¦çã§ã¯ãªãå°ããã·ã³ãã«ãªå¦çã«å解ãããããæ±ãããã¾ãï¼
ã»Dempsyã¯ãã¬ã¼ã ã¯ã¼ã¯ã§ãï¼J2EEã³ã³ããã®ãããªã¢ããªã±ã¼ã·ã§ã³ã³ã³ãããåç´ãªã©ã¤ãã©ãªã§ã¯ããã¾ããï¼ãã¿ã¼ã³ã®éåã§ããSpring Frameworkã®ããã«ï¼ãã¿ã¼ã³ãå®ç¾ããã©ã¤ãã©ãªã¨ã¤ã³ã¿ã¼ãã§ã¼ã¹ã¯ï¼ãã¿ã¼ã³ãå®è£ ããããã«ã©ã¤ãã©ãªã使ã£ã¦å®è£ ããå¿ è¦ãããã¾ãï¼
What Problem is Dempsy solving?
Dempsy is not designed to be a general purpose framework, but is intended to solve a certain class of problems while encouraging the use of the best software development practices.
Dempsy is meant to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible; problems where latency is more important that "guaranteed delivery." This class of problems includes use cases such as:
Real time monitoring of large distributed systems
Processing complete rich streams of social networking data
Real time analytics on log information generated from widely distributed systems
Statistical analytics on real-time vehicle traffic information on a global basisIt is meant to provide developers with a tool that allows them to solve these problems in a simple straightforward manner by allowing them to concentrate on the analytics themselves rather than the infrastructure. Dempsy heavily emphasizes "separation of concerns" through "dependency injection" and out of the box supports both Spring and Guice. It does all of this by supporting what can be (almost) described as a "distributed actor model."
In short Dempsy is a framework to enable decomposing a large class of message processing applications into flows of messages to relatively simple processing units implemented as POJOs
Dempsyã¯ã©ã®ãããªåé¡ã解決ãã¦ãã¾ããï¼
Dempsyã¯æ±ç¨çãªãã¬ã¼ã ã¯ã¼ã¯ã¨ãã¦è¨è¨ãããããã§ã¯ããã¾ããï¼ããããªããåªããã½ããã¦ã§ã¢éçºæ¥åãæ¨å¥¨ããæå³ã§ï¼ããç¨åº¦ã®ç¯å²ã®åé¡ã解決ã§ããããè¨è¨ããã¦ãã¾ãï¼
Dempsyã¯å¤§éã®ãã»ã¼ãªã¢ã«ã¿ã¤ã ã®ãã¹ããªã¼ã ãã¼ã¿ãï¼æå°éã®é 延å¯è½æ§ã§å¦çãããããªåé¡ã解決ããããï¼è¨è¨ããã¦ãã¾ãï¼ç¹ã«ï¼é 延ããä¿è¨¼ãããçµæåºåããããéè¦ãªåé¡ãæ³å®ãã¦ãã¾ãï¼ãã®ãããªç¯çã®åé¡ã«ã¯ï¼æ¬¡ã®ãããªãã®ãããã¾ãï¼
- 巨大ãªåæ£ã·ã¹ãã ã®ãªã¢ã«ã¿ã¤ã ç£è¦
- ã½ã¼ã·ã£ã«ãããã¯ã¼ãã³ã°ãã¼ã¿ã®ãªããã¹ããªã¼ã ã®å¦çå®äº
- åºç¯å²ã«åæ£åãããã·ã¹ãã ããçæããããã°æ å ±ã®ãªã¢ã«ã¿ã¤ã 解æ
- 大åçãªãªã¢ã«ã¿ã¤ã è»ä¸¡éè¡æ å ±ã®çµ±è¨è§£æ
ãã®æå³ããã¨ããã¯ï¼ã¤ã³ãã©ããã解æãã®ãã®ã«éçºè ãéä¸ã§ããããï¼ã·ã³ãã«ã§ç´æ¥çãªææ³ã§åé¡ã解決ã§ãããã¼ã«ãæä¾ãã¦ããã¨ãããã¨ã§ãï¼
Dempsyã¯ãä¾åæ§æ³¨å ¥ã(依存性の注入 - Wikipedia)ãéãããé¢å¿ã®åé¢ã(関心の分離 - Wikipedia)ã¨ï¼Springã¨Guiceã®æ ãè¶ ãããµãã¼ããå¼·ãæ¯æ´ãã¾ãï¼ããã¯é常ãåæ£ã¢ã¯ã¿ã¢ãã«ãã¨å¼ã°ããããµãã¼ããããã¨ã«ããå®ç¾ããã¦ãã¾ãï¼ã¾ãï¼ä¸è¨ã§è¨ãã¨ï¼Denpsyã¯ãã¡ãã»ã¼ã¸ããã·ã³ã°ã¢ããªã±ã¼ã·ã§ã³ã®åºç¯å²ãªé åãæ¯è¼çã·ã³ãã«ãªPOJOs(Plain Old Java Object - Wikipedia)ã§å®è£ ãããå¦çåä½ã®ã¡ãã»ã¼ã¸ããã¼ã«åé¢ãããã¨ãå®ç¾ãããã¬ã¼ã ã¯ã¼ã¯ã§ãï¼
Storm
ããã¦æããæ¬å½ï¼Storm.Home · nathanmarz/storm Wiki · GitHub
http://storm-project.net/documentation.html
Rationale
The past decade has seen a revolution in data processing. MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable. Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be. There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.
However, realtime data processing at massive scale is becoming more and more of a requirement for businesses. The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem.
Storm fills that hole.
Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing. Workers would process messages off a queue, update databases, and send new messages to other queues for further processing. Unfortunately, this approach has serious limitations:
Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.
Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn't lose data, scales to huge volumes of messages, and is dead-simple to use and operate?
Storm satisfies these goals.
çè«çæ ¹æ
ãã10å¹´ã§ãã¼ã¿å¦çé©å½ãããã¾ããï¼MapReduceï¼Hadoopï¼ããã¦ãããã«é¢é£ããæè¡ã¯ä»¥åã«ã¯èããããªãã£ããããªã¹ã±ã¼ã«ã§ã®ãã¼ã¿ã¹ãã¢ã¨å¦çãå¯è½ã¨ãã¾ããï¼ããããªããï¼ãããã¯ãªã¢ã«ã¿ã¤ã ã·ã¹ãã ã§ã¯ãªãï¼ãã®ããã«è¨è¨ãããã¦ãã¾ããã§ããï¼Hadoopããªã¢ã«ã¿ã¤ã ã·ã¹ãã ã«é©ç¨ããåªããæè¡ã¯ããã¾ããï¼ãªã¢ã«ã¿ã¤ã ãã¼ã¿å¦çã¯ãããå¦çã¨ã¯åºæ¬çã«å¿ è¦ãªãã®ãéãã®ã§ãï¼
ãããï¼è«å¤§ãªã¹ã±ã¼ã«ã®ãªã¢ã«ã¿ã¤ã ãã¼ã¿å¦çã¯ã¾ãã¾ããã¸ãã¹ã«å¿ è¦ã¨ããã¦ãã¦ãã¾ãï¼ããªã¢ã«ã¿ã¤ã ã®Hadoopãã®æ¬ å¦ã¯ï¼ãã¼ã¿å¦ççéã§æã大ããªè½ã¨ãç©´ã¨ãªã£ã¦ãã¾ãï¼
Stormã¯ï¼ãã®ç©´ãåãã¾ãï¼
Storm以åã§ã¯ï¼ãªã¢ã«ã¿ã¤ã å¦çã®ããã«ãã¥ã¼ã¨ã¯ã¼ã«ã®ãããã¯ã¼ã¯ãèªèº«ã§æ§ç¯ããå¿ è¦ãããã§ãããï¼ã¯ã¼ã«ã¯ãã¥ã¼ã®ã¡ãã»ã¼ã¸ãå¦çãï¼ãã¼ã¿ãã¼ã¹ãæ´æ°ãï¼æ¬¡ã®å¦çã®ããã«ä»ã®ãã¥ã¼ã«ã¡ãã»ã¼ã¸ãéããã¨ã«ãªãã§ãããï¼æ®å¿µãªããï¼ãã®ã¢ããã¼ãã¯æ·±å»ãªéçãããã¾ãï¼
- ã¤ã¾ããªãï¼éçºæéã®å¤ããã¡ãã»ã¼ã¸éä¿¡ï¼ã¯ã¼ã«ã®ãããã¤ï¼å¾ ã¡è¡åã®ãããã¤ã«è£ããªããã°ãªãã¾ããï¼æ¬æ¥æ°ã«ãããã¹ããªã¢ã«ã¿ã¤ã å¦çãã¸ãã¯ã¯ï¼ããã°ã©ãã³ã°ã«å¯¾ãã¦ç¸å¯¾çã«ãã¼ã»ã³ãã¼ã¸ãå°ãããªãã¾ãï¼
- ãããï¼ãã©ã¼ã«ãã»ãã¬ã©ã³ã¹æ§ãããã¾ããï¼ã¯ã¼ã«ã¨ãã¥ã¼ãæ´»åãç¶ãããããã¨ã«è²¬ä»»ãæã¤å¿ è¦ãããã¾ãï¼
- ã¹ã±ã¼ã«ã大å¤ï¼ã·ã³ã°ã«ã®ã¯ã¼ã«ããã¥ã¼ã«å¯¾ããã¡ãã»ã¼ã¸ã®ã¹ã«ã¼ããããé常ã«é«ããªã£ãå ´åï¼ãã¼ã¿ããã¾ãã°ãã¾ãããããã«åå²ããªããã°ãªãã¾ããï¼ãã®ããã«ã¯ä¸é¨åã®ç§»åãæ°è¦ã®é¨åãå°å ¥ãããã¨ã«ãªãã¾ããï¼ãããä¸æãåãã¨ã¯éãã¾ããï¼
ãã¥ã¼ã¨ã¯ã¼ã«ã®æ¦å¿µã¯å¤§éã®ã¡ãã»ã¼ã¸ã«å¯¾ãã¦ã¯å¤±æããã«ãé¢ãããï¼ã¡ãã»ã¼ã¸å¦çã¯æããã«ãªã¢ã«ã¿ã¤ã å¦çã®ããã®åºæ¬æ¦å¿µã§ãï¼ããã¨åé¡ã¯ï¼ããã¼ã¿æ失ç¡ãï¼å¤§éã¡ãã»ã¼ã¸ã«å¯¾ãã¹ã±ã¼ã«ãï¼é常ã«ã·ã³ãã«ã«å©ç¨ãã¤æä½ããã«ã¯ã©ãããã°è¯ãã®ã§ããããï¼ãã¨ãããã¨ã§ãï¼
Stormãªãï¼ãããã®ç®çãéãããã¨ãã§ãã¾ãï¼
Why Storm is important
Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation.
The key properties of Storm are:
Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.
Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.
ãªãStormãéè¦ãªã®ã§ããããï¼
Stormã¯ãªã¢ã«ã¿ã¤ã å¦çãè¡ãããã®ããªããã£ãéåãæããã«ãã¾ãï¼MapReduceã並åãããå¦çãé©ãã»ã©ç°¡åã«æ¸ããããã«ããããã«ï¼Stormã®ããªããã£ãã¯ä¸¦åãªã¢ã«ã¿ã¤ã å¦çã®è¨è¿°ãé©ãã»ã©ç°¡åã«ãã¾ãï¼
Stormã®ç¹å¾´çãªæ§è³ªã¯ä»¥ä¸ã®ãããªãã®ã§ãï¼
- å©ç¨ç¯å²ã®é©ç°çãªåºãï¼Stormã¯ã¡ãã»ã¼ã¸ã®å¦çã¨ãã¼ã¿ãã¼ã¹ã®æ´æ°(ã¹ããªã¼ãã³ã°å¦ç)ã«ä½¿ããã¨ãã§ãã¾ãï¼ãã®éï¼ãã¼ã¿ã¹ããªã¼ãã³ã°ã®é£ç¶çãªã¯ã¨ãªãçºè¡ãï¼çµæãã¯ã©ã¤ã¢ã³ãã«ã¹ããªã¼ãã³ã°ãï¼é£ç¶çå¦çï¼ï¼ãµã¼ãã¯ã¨ãªã®ãããªå¼·åãªã¯ã¨ãªã®ãªã³ã¶ãã©ã¤(åæ£RPC)ã®ä¸¦ååçã ãè¡ãã¾ãï¼Stormã®å°ããªããªããã£ãéåã¯ï¼é©ãã»ã©å¤ãã®å©ç¨å ´é¢ã«é©ç¨ã§ãã¾ãï¼
- ã¹ã±ã¼ã©ãã«ï¼Stormã¯æ¯ç§ã®è«å¤§ãªã¡ãã»ã¼ã¸æ°ã¾ã§ã¹ã±ã¼ã«ãã¾ãï¼ãããã¸ã¼ãã¹ã±ã¼ã«ããããã«ãããªããã°ãªããªããã¨ã¯ï¼ãã·ã³ãå¢è¨ãã¦ãããã¸ã¼ã®ä¸¦åè¨å®ãå¢å ããããã¨ã ãã§ãï¼Stormã®ã¹ã±ã¼ã«ã®ä¾ã¨ãã¦ï¼ãã10ãã¼ãã¯ã©ã¹ã¿nã®Stormã®åæã¢ããªã±ã¼ã·ã§ã³ãä½ç¾ãã®ç§éã®ãã¼ã¿ãã¼ã¹ã¢ã¯ã»ã¹ã¨å ±ã«ï¼ç§é1,000,000ã¡ãã»ã¼ã¸ãå¦çãã¦ããã¨ãã¾ãããï¼Stormã®ã¯ã©ã¹ã¿ç®¡çã®ããã®Zookeeperã®å©ç¨ã¯ï¼ã¯ã©ã¹ã¿ã®ãµã¤ãºã大ããã¹ã±ã¼ã«ããã¾ãï¼
- ãã¼ã¿æ失ç¡ãã®ä¿è¨¼ï¼ãªã¢ã«ã¿ã¤ã ã·ã¹ãã ã¯ãã¼ã¿å¦çãæåãããã¨ãå¼·ãä¿è¨¼ããªããã°ãªãã¾ããï¼ã·ã¹ãã ããã¼ã¿ãæ失ãããã¨ã¯ï¼é常ã«éãããå ´åã§ããã¹ãã§ãï¼Stormã¯åã¡ãã»ã¼ã¸ãå¦çããããã¨ãä¿è¨¼ãã¾ãï¼ããã¦ãããï¼ä»ã®S4ã®ãããªã·ã¹ãã ã¨ã®éãã¨ãªãã¾ãï¼
- 極ãã¦ããã¹ãï¼éç¨ãé£ããã¦æªåé«ãHadoopã®ãããªã·ã¹ãã ã¨ç°ãªãï¼Stormã¯ã©ã¹ã¿ã¯ãã¡ãã¨åãã¾ãï¼ã¦ã¼ã¶ã®Stormã¯ã©ã¹ã¿éç¨çµé¨ãè¦çã§ãªããªãããã«ãããã¨ãStormããã¸ã§ã¯ãã®ããããã´ã¼ã«ã§ãï¼
- èé害æ§ï¼è¨ç®å¦çå®è¡ä¸ã«é害ãçºçãããï¼Stormã¯å¯è½ãªéãã¿ã¹ã¯ãåå²å½ã¦ãåå®è¡ãããã¨ãã¾ãï¼Stormã¯ï¼è¨ç®å¦çãkillãããªãéãï¼å¿ ãå®è¡ããããã¨ãä¿è¨¼ãã¾ãï¼
- ããã°ã©ãã³ã°è¨èªéä¾åï¼ããã¹ãã§ã¹ã±ã¼ã©ãã«ãªãªã¢ã«ã¿ã¤ã å¦çã¯ï¼ä¸ã¤ã¨ãã©ãããã©ã¼ã ã«å¶éãããã¹ãã§ã¯ããã¾ããï¼Stormãããã¸ã¼ã¨ã³ã³ãã¼ãã³ãå¦çã¯ï¼ã©ããªè¨èªã§ãå®ç¾©å¯è½ã§ï¼ã»ã¨ãã©ã©ãã§ãã¢ã¯ã»ã¹å¯è½ã§ãï¼
ç¶ãã¯ï¼ã¾ãï¼ãããï¼ï¼ï¼