The Berkeley Data Analytics Stack (BDAS) aims to address emerging challenges in data analysis through a set of systems, including Spark, Shark and Mesos, that enable faster and more powerful analytics. In this talk, we’ll cover two recent additions to BDAS:
* Spark Streaming is an extension of Spark that enables high-speed, fault-tolerant stream processing through a high-level API. It uses a new processing model called “discretized streams” to enable fault-tolerant stateful processing with exactly-once semantics, without the costly transactions required by existing systems. This lets applications process much higher rates of data per node. It also makes programming streaming applications easier by providing a set of high-level operators on streams (e.g. maps, filters, and windows) in Java and Scala.
* Shark is a Spark-based data warehouse system compatible with Hive. It can answer Hive QL queries up to 100 times faster than Hive without modification to existing data or queries. Shark supports Hive’s query language, metastore, serialization formats, and user-defined functions. It employs a number of novel and traditional database optimization techniques, including column-oriented storage and mid-query replanning, to efficiently execute SQL on top of Spark. The system is in early use at companies including Yahoo! and Conviva.
1 of 48
Downloaded 208 times
More Related Content
What's New in the Berkeley Data Analytics Stack
1. What's New in the
Berkeley Data Analytics
Stack
Tathagata Das, Reynold Xin (AMPLab, UC
Berkeley)
Hadoop Summit 2013 UC BERKELEY
4. Project History
2010: Spark (core execution engine) open
sourced
2012: Shark open sourced
Feb 2013: Spark Streaming alpha open
sourced
Jun 2013: Spark entered Apache Incubator
12. Spark
Fast and expressive cluster computing
system interoperable with Apache Hadoop
Improves efficiency through:
»In-memory computing primitives
»General computation graphs
Improves usability through:
»Rich APIs in Scala, Java, Python
»Interactive shell
Up to 100× faster
(2-10× on disk)
Often 5× less code
13. Why a New Framework?
MapReduce greatly simplified big data
analysis
But as soon as it got popular, users wanted
more:
»More complex, multi-pass analytics (e.g. ML,
graph)
»More interactive ad-hoc queries
»More real-time stream processing
14. Spark Programming Model
Key idea: resilient distributed datasets
(RDDs)
»Distributed collections of objects
»Can optionally be cached in memory across
cluster
»Manipulated through parallel operators
»Automatically recomputed on failure
Programming interface
»Functional APIs in Scala, Java, Python
»Interactive use from Scala and Python shell
15. Example: Log Mining
Exposes RDDs through a functional API in
Java, Python, Scala
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
errors.persist()
Block 1
Block 2
Block 3
Worke
r
errors.filter(_.contains(“foo”)).count()
errors.filter(_.contains(“bar”)).count()
tasks
results
Errors 2
Base RDD
Transformed RDD
Action
Result: full-text search of
Wikipedia in <1 sec (vs 20 sec
for on-disk data)
Result: 1 TB data in 5 sec
(vs 170 sec for on-disk data)
Worke
r
Errors 3
Worke
r
Errors 1
Master
18. Spark in Java and Python
Python API
lines = spark.textFile(…)
errors = lines.filter(
lambda s: "ERROR" in s)
errors.count()
Java API
JavaRDD<String> lines =
spark.textFile(…);
errors = lines.filter(
new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains("ERROR");
}
});
errors.count()
19. Projects Building on Spark
Spark
Shark
SQL
HDFS / Hadoop Storage
Mesos/YARN Resource Manager
Spark
Streaming
GraphX MLBase
20. GraphX
Combining data-parallel and graph-parallel
»Run graph analytics and ETL in the same engine
»Consume graph computation output in Spark
»Interactive shell
Programmability
»Support GraphLab / Pregel APIs in 20 LOC
»Implement PageRank in 5 LOC
Coming this summer as a Spark module
21. Scalable Machine Learning
Build a Classifier
for X
What you want to
do
What you have to
do• Learn the internals of ML
classification
algorithms, sampling, featur
e selection, X-validation,….
• Potentially learn
Spark/Hadoop/…
• Implement 3-4 algorithms
• Implement grid-search to
find the right algorithm
parameters
• Implement validation
algorithms
• Experiment with different
sampling-
sizes, algorithms, features
• ….
and in the end
Ask For Help
21
22. MLBase
Making large scale machine learning easy
»User specifies the task (e.g. “classify this
dataset”)
»MLBase picks the best algorithm and best
parameters for the task
Develop scalable, high-quality ML algorithms
»Naïve Bayes
»Logistic/Least Squares Regression (L1/L2
Regularization)
»Matrix Factorization (ALS, CCD)
»K-Means & DP-Means
First release (summer): collection of scalable
algorithms
24. Shark
Hive compatible: HiveQL, UDFs, metadata,
etc.
»Works in existing Hive warehouses without
changing queries or data!
Fast execution engine
»Uses Spark as the underlying execution engine
»Low-latency, interactive queries
»Scales out and tolerate worker failures
Easy to combine with Spark
»Process data with SQL queries as well as raw
28. Spark Streaming
Extends Spark for large scale stream
processing
»Receive data directly from Kafka, Flume, Twitter,
etc.
»Fast, scalable, and fault-tolerant
Simple, yet rich batch-like API
»Easy to express your complex streaming
computation
»Fault-tolerant, stateful stream processing out of
29. Motivation
Many important applications must process large
streams of live data and provide results in near-
real-time
» Social network trends
» Website statistics
» Intrusion detection systems
» Etc.
31. Integration with Batch
Processing
Many environments require processing same data
in live streaming as well as batch post-processing
Hard for any existing single framework to achieve
both
» Provide low latency for streaming workloads
» Handle large volumes of data for batch workloads
Extremely painful to maintain two stacks
» Different programming models
» Double the implementation effort
» Double the number of bugs
32. Existing Streaming Systems
Storm – Limited fault-tolerance guarantee
»Replays records if not processed
»Processes each record at least once
»May double count events!
»Mutable state can be lost due to failure!
Trident – Use transactions to update state
»Processes each record exactly once
»Per state transaction to external database is slow
Neither integrate well with batch processing
systems
33. Spark Streaming
• Chop up the live stream into
batches of X seconds
• Spark treats each batch of data
as RDDs and processes them
using RDD operations
• Finally, the processed results
of the RDD operations are
returned in batches
Spark
Spark
Streamin
g
batches of X
seconds
live data stream
processed
results
Discretized Stream Processing - run a streaming
computation as a series of very small, deterministic
batch jobs
34. Spark Streaming
Discretized Stream Processing - run a streaming
computation as a series of very small, deterministic
batch jobs
• Batch sizes as low as ½
second, latency ~ 1
second
• Potential for combining
batch processing and
streaming processing in
the same system
Spark
Spark
Streamin
g
batches of X
seconds
live data stream
processed
results
35. Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
DStream: a sequence of RDDs representing a stream
of data
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
stored in memory as an
RDD
(immutable, distributed)
Twitter Streaming API
36. Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
Twitter Streaming API
transformation: modify data in one DStream to create
another DStream
new DStream
flatMap flatMap flatMap
…
new RDDs created
for every batch
hashTags
Dstream
[#cat, #dog, … ]
37. Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
every batch
saved to HDFS
38. Example: Get Twitter
Hashtags
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { … })
foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @
t+1
batch @ t
batch @
t+2
tweets DStream
hashTags
DStream
Write to database, update
analytics UI, do whatever you
want
39. Window-based
Transformations
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
DStream of data
sliding window
operation
window length sliding interval
window length
sliding interval
40. Arbitrary Stateful
Computations
Specify function to generate new state based on
previous state and new data
» Example: Maintain per-user mood as state, and update it
with their tweets
updateMood(newTweets, lastMood) => newMood
moods = tweets.updateStateByKey(tweets => updateMood(tweets))
» Exactly-once semantics even under worker failures
41. Arbitrary Combination of Batch
and Streaming Computations
Inter-mix RDD and DStream operations!
» Example: Join incoming tweets with a spam HDFS file
to filter out bad tweets
tweets.transform(tweetsRDD => {
tweetsRDD.join(spamHDFSFile).filter(...)
})
42. DStream Input Sources
Out of the box we provide
»Kafka
»Twitter
»HDFS
»Flume
»Raw TCP sockets
Very simple API to write a receiver for your
own data source!
43. Performance
Can process 6 GB/sec (60M records/sec) of data
on 100 nodes at sub-second latency
» Tested with 100 text streams on 100 EC2 instances with 4 cores
each
0
1
2
3
4
5
6
7
0 50 100
ClusterThhroughput
(GB/s)
# Nodes in Cluster
Grep
1 sec
2 sec
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100
ClusterThroughput(GB/s)
# Nodes in Cluster
WordCount
1 sec
2 sec
High Throughput
and
Low Latency
44. Comparison with Storm
Higher throughput than Storm
»Spark Streaming: 670k records/second/node
»Storm: 115k records/second/node
0
40
80
120
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
Grep
Spark
Stor
m
0
10
20
30
100 1000
Throughputpernode
(MB/s)
Record Size (bytes)
WordCount
Spark
Storm
46. Real Applications: Traffic
Sensing
Traffic transit time estimation using online machine
learning on GPS observations
• Markov chain Monte Carlo
simulations on GPS
observations
• Very CPU
intensive, requires dozens
of machines for useful
computation
• Scales linearly with cluster
size
0
400
800
1200
1600
2000
0 20 40 60 80
GPSobservationspersecond
# Nodes in Cluster
47. Unifying Batch and Stream
Models
Spark program on Twitter log file using RDDs
val tweets = sc.hadoopFile("hdfs://...")
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFile("hdfs://...")
Spark Streaming program on Twitter stream using
DStreams
val tweets = ssc.twitterStream(<username>, <password>)
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
Same code base works for both batch
processing and stream processing
48. Conclusion
Berkeley Data Analytics Stack
»Next generation of data analytics stack with
speed and functionality
More information: www.spark-project.org
Hands-on Tutorials:
ampcamp.berkeley.edu
»Video tutorials, EC2 exercises
»AMP Camp 2 – August 29-30, 2013