JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツHolden Karau
The Japanese version of "Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ"
皆さんについて
RDD の再利用 (キャッシング、永続化レベル、およびチェックポイント機能)
キー・バリュー・データの処理
group キーの使用が危険な理由と対処方法
Spark アキュムレーターに関するベスト・プラクティス*
Spark SQL がすばらしい理由
Spark MLLib のパフォーマンスを高めるための将来の機能強化に関する説明
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Web Services
"This presentation will introduce Kinesis, the new AWS service for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and how Kinesis can help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll explore the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. We’ll walk through a candidate use case in detail, starting from creating an appropriate Kinesis stream for the use case, configuring data producers to push data into Kinesis, and creating the application that reads from Kinesis and performs real-time processing. This talk will also include key lessons learnt, architectural tips and design considerations in working with Kinesis and building real-time processing applications."
JP version - Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツHolden Karau
The Japanese version of "Beyond Shuffling - Apache Spark のスケールアップのためのヒントとコツ"
皆さんについて
RDD の再利用 (キャッシング、永続化レベル、およびチェックポイント機能)
キー・バリュー・データの処理
group キーの使用が危険な理由と対処方法
Spark アキュムレーターに関するベスト・プラクティス*
Spark SQL がすばらしい理由
Spark MLLib のパフォーマンスを高めるための将来の機能強化に関する説明
Amazon Kinesis: Real-time Streaming Big data Processing Applications (BDT311)...Amazon Web Services
"This presentation will introduce Kinesis, the new AWS service for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and how Kinesis can help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll explore the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. We’ll walk through a candidate use case in detail, starting from creating an appropriate Kinesis stream for the use case, configuring data producers to push data into Kinesis, and creating the application that reads from Kinesis and performs real-time processing. This talk will also include key lessons learnt, architectural tips and design considerations in working with Kinesis and building real-time processing applications."
9. Spark
vs.
Disk
I/O
• キャッシュ
• データセットのキャッシュ
• 計算結果のキャッシュ
➡Disk
I/O減
• RDD(Resillient
Distributed
Datasets)
• キャッシュはクラスタノード間で分散保持
➡一部が失われても復旧可能
10. Spark
vs.
task
launching
time
• 論文曰く、「fast
event-‐driven
RPC
libraryを使ったよ」
11. Spark
vs.
task
launching
time
• 論文曰く、「fast
event-‐driven
RPC
libraryを使ったよ」
• 5〜10sec
➡
5ms
• Ref.
“Shark:
SQL
and
Rich
AnalyWcs
at
Scale”
hYps://www.icsi.berkeley.edu/pubs/networking/ICSI_sharksql12.pdf
12. 結果、繰り返し処理の高速化
Ref.
“Spark:
A
framework
for
iteraWve
and
interacWve
cluster
compuWng”
hYp://laser.inf.ethz.ch/2013/material/joseph/LASER-‐Joseph-‐6.pdf