SlideShare a Scribd company logo
Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
PrestoInteractive SQL Query Engine for Big Data
Hadoop Conference in Japan 2014
A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - efficient object serializer
> Fluentd - data collection tool
> ServerEngine - Ruby framework to build multiprocess servers
> LS4 - distributed object storage system
> kumofs - distributed key-value data store
0. Background + Intro
What’s Presto?
A distributed SQL query engine
for interactive data analisys
against GBs to PBs of data.
Presto’s history
> 2012 Fall: Project started at Facebook
> Designed for interactive query
> with speed of commercial data warehouse
> and scalability to the size of Facebook
> 2013 Winter: Open sourced!
> 30+ contributes in 6 months
> including people from outside of Facebook
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
What’s the problems to solve?
> We couldn’t visualize data in HDFS directly
using dashboards or BI tools
> because Hive is too slow (not interactive)
> or ODBC connectivity is unavailable/unstable
> We needed to store daily-batch results to an
interactive DB for quick response
(PostgreSQL, Redshift, etc.)
> Interactive DB costs more and less scalable by far
> Some data are not stored in HDFS
> We need to copy the data into HDFS to analyze
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
Commercial
BI Tools
Batch analysis platform Visualization platform
Dashboard
HDFS
Hive
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost
Commercial
BI Tools
Dashboard
✓ More work to manage
2 platforms
✓ Can’t query against
“live”data directly
Batch analysis platform Visualization platform
HDFS
Hive Dashboard
Presto
PostgreSQL, etc.
Daily/Hourly Batch
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Interactive query
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets
Presto
HDFS
Hive
Dashboard
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial
BI Tools
✓ IBM Cognos
✓ Tableau
✓ ...
Data analysis platform
dashboard on chart.io: https://chartio.com/
What can Presto do?
> Query interactively (in milli-seconds to minues)
> MapReduce and Hive are still necessary for ETL
> Query using commercial BI tools or dashboards
> Reliable ODBC/JDBC connectivity
> Query across multiple data sources such as
Hive, HBase, Cassandra, or even commertial DBs
> Plugin mechanism
> Integrate batch analisys + visualization
into a single data analysis platform
Presto’s deployment
> Facebook
> Multiple geographical regions
> scaled to 1,000 nodes
> actively used by 1,000+ employees
> who run 30,000+ queries every day
> processing 1PB/day
> Netflix, Dropbox, Treasure Data, Airbnb, Qubole
> Presto as a Service
Today’s talk
1. Distributed architecture
2. Data visualization - Demo
3. Query Execution - Presto vs. MapReduce
4. Monitoring & Configuration
5. Roadmap - the future
1. Distributed architecture
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. find servers in a cluster
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Client sends a query
using HTTP
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator builds
a query plan
Connector plugin
provides metadata
(table schema, etc.)
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Coordinator sends
tasks to workers
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers read data
through connector plugin
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
6. Workers run tasks
in memory
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
7. Client gets the result
from a worker
Client
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
What’s Connectors?
> Connectors are plugins to Presto
> written in Java
> Access to storage and metadata
> provide table schema to coordinators
> provide table rows to workers
> Implementations:
> Hive connector
> Cassandra connector
> MySQL through JDBC connector (prerelease)
> Or your own connector
Client
Coordinator Hive
Connector
Worker
Worker
Worker
HDFS,
Hive Metastore
Discovery Service
find servers in a cluster
Hive connector
Client
Coordinator Cassandra
Connector
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Cassandra connector
Client
Coordinator
other
connectors
...
Worker
Worker
Worker
Cassandra
Discovery Service
find servers in a cluster
Hive
Connector
HDFS / Metastore
Multiple connectors in a query
Cassandra
Connector
Other data sources...
1. Distributed architecture
> 3 type of servers:
> Coordinator, worker, discovery service
> Get data/metadata through connector plugins.
> Presto is NOT a database
> Presto provides SQL to existent data stores
> Client protocol is HTTP + JSON
> Language bindings:
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
Coordinator
Coordinator HA
2. Data visualization
The problems to use BI tools
> BI tools need ODBC or JDBC connectivity
> Tableau, IBM Cognos, QlickView, Chart.IO, ...
> JasperSoft, Pentaho, MotionBoard, ...
> ODBC/JDBC is VERY COMPLICATED
> Matured implementation needs LONG time
A solution: PostgreSQL protocol
> Creating a PostgreSQL protocol gateway
> Using PostgreSQL’s stable ODBC / JDBC driver
https://github.com/treasure-data/prestogres
How Prestogres works?
2. select run_presto_as_temp_table(
‘presto_result’,‘SELECT COUNT(1) FROM tbl1’);
pgpool-II
+ patchclient
1. SELECT COUNT(1) FROM tbl1
4. SELECT * FROM presto_result;
PostgreSQL
3.“run_persto_as_temp_table”function
runs query on Presto
Coordinator
Demo
2. Data visualization with Presto
> Data visualization tools need ODBC/JDBC driver
> but implemetation takes LONG time
> A solution is to use PostgreSQL protocol
> and use PostgreSQL’s ODBC/JDBC driver
> Prestogres is already confirmed to work with
some commertial BI tools
3. Query Execution
Presto’s execution model
> Presto is NOT MapReduce
> Presto’s query plan is based on DAG
> more like Apache Tez or traditional MPP
databases
How query runs?
> Coordinator
> SQL Parser
> Query Planner
> Execution planner
> Workers
> Task execution scheduler
SQL
SQL Parser
AST
Logical
Planner
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Server
Connector
Distributed
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
SQL
SQL Parser
SQL
Distributed
Planner
Logical
Query Plan
Execution
Planner
Discovery Service
Connector
Query Plan Execution Plan
Optimizer
NodeManager
✓ node list
✓ table schema
Metadata
(today’s talk)
Query
Planner
Query Planner
SELECT
name,
count(*) AS c
FROM impressions
GROUP BY name
SQL
impressions (
name varchar
time bigint
)
Table schema
Table scan
(name:varchar)
GROUP BY
(name, count(*))
Output
(name, c)
+
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Logical query plan
Distributed query plan
Query Planner - Stages
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
inter-worker
data transfer
pipelined
aggregation
inter-worker
data transfer
Stage-0
Stage-1
Stage-2
Sink
Partial aggregation
Table scan
Sink
Partial aggregation
Table scan
Execution Planner
+ Node list
✓ 2 workers
Sink
Final aggregation
Exchange
Output
Exchange
Sink
Final aggregation
Exchange
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Worker 1 Worker 2
Execution Planner - Tasks
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Task
1 task / worker / stage
✓ All tasks in parallel
Output
Exchange
Worker 1 Worker 2
Execution Planner - Split
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Split
many splits / task
= many threads / worker
(table scan)
1 split / task
= 1 thread / worker
Worker 1 Worker 2
1 split / worker
= 1 thread / worker
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data
to disk
Wait between
stages
3. Query Execution
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If aggregated data doesn’t fit in memory,
query fails
• Note: query dies but worker doesn’t die.
Memory consumption of all queries is fully managed
4. Monitoring & Configuration
Monitoring
> Web UI
> basic query status check
> JMX HTTP API
> GET /v1/jmx/mbean[/{objectName}]
• com.facebook.presto.execution:name=TaskManager
• com.facebook.presto.execution:name=QueryManager
• com.facebook.presto.execution:name=NodeScheduler
> Event notification (remote logging)
> POST http://remote.server/v2/event
• query start, query complete, split complete
Configuration
> Execution planning (for coordinator)
> query.initial-hash-partitions
• max number of hash buckets (=tasks) of a GROUP BY
(default: 8)
> node-scheduler.min-candidates
• max number of workers to run a stage in parallel
(default: 10)
> node-scheduler.include-coordinator
• whether run tasks only on workers or include coordinator
> query.schedule-split-batch-size
• number of splits of a stage to start at once
Configuration
> Task execution (for workers)
> task.cpu-timer-enabled
• enable detailed statistics (causes some overhead)
(default: true)
> task.max-memory
• memory limit of a task especially for hash tables used by
GROUP BY and JOIN operations (default: 256MB)
• enlarge if you get“Task exceeded max memory size”error
> task.shard.max-threads
• max number of threads of a worker to run active splits
(default: number of CPU cores * 4)
5. Roadmap
A report of Presto Meetup 2014
http://www.slideshare.net/dain1/presto-meetup-20140514-34731104
"Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
Presto’s future
> Huge JOIN and GROUP BY
> Spill to disk
> Task recovery
> CREATE VIEW (※implemented)
> Native store (※implemented)
> Fast data store in Presto workers
> to cache hot data
> Authentication and permissions
Presto’s future
> DDL/DML statements
> CREATE TABLE with partitioning
> DELETE and INSERT
> Plugin repository
> CLI plugin manager
> JOIN and aggregation pushdown
> Custom optimizers
Links
> Web site & document
> http://prestodb.io
> Mailing list
> https://groups.google.com/group/presto-users
> Github
> https://github.com/facebook/presto
> Guidelines for contribution
> https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!

More Related Content

What's hot (20)

Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
ScyllaDB
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Filip Ilievski
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
API : l'architecture REST
API : l'architecture RESTAPI : l'architecture REST
API : l'architecture REST
Fadel Chafai
 
java Servlet technology
java Servlet technologyjava Servlet technology
java Servlet technology
Tanmoy Barman
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Mongoose getting started-Mongo Db with Node js
Mongoose getting started-Mongo Db with Node jsMongoose getting started-Mongo Db with Node js
Mongoose getting started-Mongo Db with Node js
Pallavi Srivastava
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Big Data - Conceptos, herramientas y patrones
Big Data - Conceptos, herramientas y patronesBig Data - Conceptos, herramientas y patrones
Big Data - Conceptos, herramientas y patrones
Juan José Domenech
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
Dushhyant Kumar
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
ScyllaDB
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
API : l'architecture REST
API : l'architecture RESTAPI : l'architecture REST
API : l'architecture REST
Fadel Chafai
 
java Servlet technology
java Servlet technologyjava Servlet technology
java Servlet technology
Tanmoy Barman
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Mongoose getting started-Mongo Db with Node js
Mongoose getting started-Mongo Db with Node jsMongoose getting started-Mongo Db with Node js
Mongoose getting started-Mongo Db with Node js
Pallavi Srivastava
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Big Data - Conceptos, herramientas y patrones
Big Data - Conceptos, herramientas y patronesBig Data - Conceptos, herramientas y patrones
Big Data - Conceptos, herramientas y patrones
Juan José Domenech
 

Similar to Presto - Hadoop Conference Japan 2014 (20)

SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
N Masahiro
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
Sadayuki Furuhashi
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
N Masahiro
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
Sadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
k4ndar
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
Matt Fuller
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
Prashanth Shankar kumar
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
N Masahiro
 
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
N Masahiro
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Alluxio, Inc.
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
k4ndar
 
Boston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the Enterprise
Matt Fuller
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
Jan Pieter Posthuma
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
Sadayuki Furuhashi
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop EcosystemUnveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
Ad

More from Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
Sadayuki Furuhashi
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Sadayuki Furuhashi
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
Sadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
Sadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
Sadayuki Furuhashi
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
Sadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
Sadayuki Furuhashi
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
Sadayuki Furuhashi
 
upload test 1
upload test 1upload test 1
upload test 1
Sadayuki Furuhashi
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Sadayuki Furuhashi
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
Sadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
Sadayuki Furuhashi
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
Sadayuki Furuhashi
 
Embulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダEmbulk - 進化するバルクデータローダ
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
Sadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
Sadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
Sadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
Sadayuki Furuhashi
 
Ad

Recently uploaded (20)

Streamline English Destinations.pdf for ev
Streamline English Destinations.pdf for evStreamline English Destinations.pdf for ev
Streamline English Destinations.pdf for ev
PhuongNguyen180931
 
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
Caribbean Development Bank
 
Pentecost Sunday A Promise of Power.pptx
Pentecost Sunday A Promise of Power.pptxPentecost Sunday A Promise of Power.pptx
Pentecost Sunday A Promise of Power.pptx
FamilyWorshipCenterD
 
2025-06-08 Abraham 02 (shared slides).pptx
2025-06-08 Abraham 02 (shared slides).pptx2025-06-08 Abraham 02 (shared slides).pptx
2025-06-08 Abraham 02 (shared slides).pptx
Dale Wells
 
Food Truck Business Plan | Sakthi Sundar.pptx
Food Truck Business Plan | Sakthi Sundar.pptxFood Truck Business Plan | Sakthi Sundar.pptx
Food Truck Business Plan | Sakthi Sundar.pptx
Sakthi Sundar
 
DLD Bounas Assignment IOT.pdf by nouman saleem
DLD Bounas Assignment IOT.pdf by nouman saleemDLD Bounas Assignment IOT.pdf by nouman saleem
DLD Bounas Assignment IOT.pdf by nouman saleem
noumansaleempc
 
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy Los
 
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEWTHE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
rspyamin
 
Diddy Baby oil making tutorial (natural ingresients.pptx
Diddy Baby oil making tutorial (natural ingresients.pptxDiddy Baby oil making tutorial (natural ingresients.pptx
Diddy Baby oil making tutorial (natural ingresients.pptx
RanitMal
 
Media of Advertisement-How to choose it.pptx
Media of Advertisement-How to choose it.pptxMedia of Advertisement-How to choose it.pptx
Media of Advertisement-How to choose it.pptx
bugisatrioadiwibowo
 
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
Bryan K. O'Rourke
 
Road safety presentation for high school
Road safety presentation for high schoolRoad safety presentation for high school
Road safety presentation for high school
nikhithavarghese77
 
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
Dutch Power
 
presentacion de Inspire Power Point.pptx
presentacion de Inspire Power Point.pptxpresentacion de Inspire Power Point.pptx
presentacion de Inspire Power Point.pptx
teamspro
 
3e. Leoni Winschermann - session 3 pitch.pdf
3e. Leoni Winschermann - session 3 pitch.pdf3e. Leoni Winschermann - session 3 pitch.pdf
3e. Leoni Winschermann - session 3 pitch.pdf
Dutch Power
 
Presenation - compensation plan - Mining Race - NEW - June 2025
Presenation - compensation plan - Mining Race - NEW - June 2025Presenation - compensation plan - Mining Race - NEW - June 2025
Presenation - compensation plan - Mining Race - NEW - June 2025
Mining RACE
 
AI Intelligence: Exploring the Future of Artificial Intelligence
AI Intelligence: Exploring the Future of Artificial IntelligenceAI Intelligence: Exploring the Future of Artificial Intelligence
AI Intelligence: Exploring the Future of Artificial Intelligence
sayalikerimova20
 
Retail Store Scavenger Hunt experience!!
Retail Store Scavenger Hunt experience!!Retail Store Scavenger Hunt experience!!
Retail Store Scavenger Hunt experience!!
Samally Dávila
 
Jadual Waktu dan Jadual Bertugas kelas.pptx
Jadual Waktu dan Jadual Bertugas kelas.pptxJadual Waktu dan Jadual Bertugas kelas.pptx
Jadual Waktu dan Jadual Bertugas kelas.pptx
roslan17
 
How Does a Configuration Management Plan Mitigate Risks?
How Does a Configuration Management Plan Mitigate Risks?How Does a Configuration Management Plan Mitigate Risks?
How Does a Configuration Management Plan Mitigate Risks?
Writegenic AI
 
Streamline English Destinations.pdf for ev
Streamline English Destinations.pdf for evStreamline English Destinations.pdf for ev
Streamline English Destinations.pdf for ev
PhuongNguyen180931
 
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
The Caribbean Challenge: Fostering Growth and Resilience Amidst Global Uncert...
Caribbean Development Bank
 
Pentecost Sunday A Promise of Power.pptx
Pentecost Sunday A Promise of Power.pptxPentecost Sunday A Promise of Power.pptx
Pentecost Sunday A Promise of Power.pptx
FamilyWorshipCenterD
 
2025-06-08 Abraham 02 (shared slides).pptx
2025-06-08 Abraham 02 (shared slides).pptx2025-06-08 Abraham 02 (shared slides).pptx
2025-06-08 Abraham 02 (shared slides).pptx
Dale Wells
 
Food Truck Business Plan | Sakthi Sundar.pptx
Food Truck Business Plan | Sakthi Sundar.pptxFood Truck Business Plan | Sakthi Sundar.pptx
Food Truck Business Plan | Sakthi Sundar.pptx
Sakthi Sundar
 
DLD Bounas Assignment IOT.pdf by nouman saleem
DLD Bounas Assignment IOT.pdf by nouman saleemDLD Bounas Assignment IOT.pdf by nouman saleem
DLD Bounas Assignment IOT.pdf by nouman saleem
noumansaleempc
 
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy J. Los, JD, LL.M. (Tax) – Global Investor, Family Office & Profession...
Timothy Los
 
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEWTHE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
THE INTERIOR REVIEW MEDIA KIT - THE INTERIOR REVIEW
rspyamin
 
Diddy Baby oil making tutorial (natural ingresients.pptx
Diddy Baby oil making tutorial (natural ingresients.pptxDiddy Baby oil making tutorial (natural ingresients.pptx
Diddy Baby oil making tutorial (natural ingresients.pptx
RanitMal
 
Media of Advertisement-How to choose it.pptx
Media of Advertisement-How to choose it.pptxMedia of Advertisement-How to choose it.pptx
Media of Advertisement-How to choose it.pptx
bugisatrioadiwibowo
 
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
FUTURE OF FITNESS 2025 KEYNOTE BRYAN OROURKE BEYOND ACTIV SINGAPORE 2025
Bryan K. O'Rourke
 
Road safety presentation for high school
Road safety presentation for high schoolRoad safety presentation for high school
Road safety presentation for high school
nikhithavarghese77
 
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
3d. Alejandra Jimenz Rosales - Session 3 pitch.pdf
Dutch Power
 
presentacion de Inspire Power Point.pptx
presentacion de Inspire Power Point.pptxpresentacion de Inspire Power Point.pptx
presentacion de Inspire Power Point.pptx
teamspro
 
3e. Leoni Winschermann - session 3 pitch.pdf
3e. Leoni Winschermann - session 3 pitch.pdf3e. Leoni Winschermann - session 3 pitch.pdf
3e. Leoni Winschermann - session 3 pitch.pdf
Dutch Power
 
Presenation - compensation plan - Mining Race - NEW - June 2025
Presenation - compensation plan - Mining Race - NEW - June 2025Presenation - compensation plan - Mining Race - NEW - June 2025
Presenation - compensation plan - Mining Race - NEW - June 2025
Mining RACE
 
AI Intelligence: Exploring the Future of Artificial Intelligence
AI Intelligence: Exploring the Future of Artificial IntelligenceAI Intelligence: Exploring the Future of Artificial Intelligence
AI Intelligence: Exploring the Future of Artificial Intelligence
sayalikerimova20
 
Retail Store Scavenger Hunt experience!!
Retail Store Scavenger Hunt experience!!Retail Store Scavenger Hunt experience!!
Retail Store Scavenger Hunt experience!!
Samally Dávila
 
Jadual Waktu dan Jadual Bertugas kelas.pptx
Jadual Waktu dan Jadual Bertugas kelas.pptxJadual Waktu dan Jadual Bertugas kelas.pptx
Jadual Waktu dan Jadual Bertugas kelas.pptx
roslan17
 
How Does a Configuration Management Plan Mitigate Risks?
How Does a Configuration Management Plan Mitigate Risks?How Does a Configuration Management Plan Mitigate Risks?
How Does a Configuration Management Plan Mitigate Risks?
Writegenic AI
 

Presto - Hadoop Conference Japan 2014

  • 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. PrestoInteractive SQL Query Engine for Big Data Hadoop Conference in Japan 2014
  • 2. A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure Data, Inc. > Founder & Software Architect > Open-source hacker > MessagePack - efficient object serializer > Fluentd - data collection tool > ServerEngine - Ruby framework to build multiprocess servers > LS4 - distributed object storage system > kumofs - distributed key-value data store
  • 4. What’s Presto? A distributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 5. Presto’s history > 2012 Fall: Project started at Facebook > Designed for interactive query > with speed of commercial data warehouse > and scalability to the size of Facebook > 2013 Winter: Open sourced! > 30+ contributes in 6 months > including people from outside of Facebook
  • 6. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 7. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 8. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 9. What’s the problems to solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more and less scalable by far > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 10. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial BI Tools Batch analysis platform Visualization platform Dashboard
  • 11. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial BI Tools Dashboard ✓ More work to manage 2 platforms ✓ Can’t query against “live”data directly Batch analysis platform Visualization platform
  • 12. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 14. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 15. dashboard on chart.io: https://chartio.com/
  • 16. What can Presto do? > Query interactively (in milli-seconds to minues) > MapReduce and Hive are still necessary for ETL > Query using commercial BI tools or dashboards > Reliable ODBC/JDBC connectivity > Query across multiple data sources such as Hive, HBase, Cassandra, or even commertial DBs > Plugin mechanism > Integrate batch analisys + visualization into a single data analysis platform
  • 17. Presto’s deployment > Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > who run 30,000+ queries every day > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole > Presto as a Service
  • 18. Today’s talk 1. Distributed architecture 2. Data visualization - Demo 3. Query Execution - Presto vs. MapReduce 4. Monitoring & Configuration 5. Roadmap - the future
  • 21. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 1. find servers in a cluster
  • 22. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 2. Client sends a query using HTTP
  • 23. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 3. Coordinator builds a query plan Connector plugin provides metadata (table schema, etc.)
  • 24. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 4. Coordinator sends tasks to workers
  • 25. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 5. Workers read data through connector plugin
  • 26. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 6. Workers run tasks in memory
  • 27. Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service 7. Client gets the result from a worker Client
  • 29. What’s Connectors? > Connectors are plugins to Presto > written in Java > Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Implementations: > Hive connector > Cassandra connector > MySQL through JDBC connector (prerelease) > Or your own connector
  • 32. Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 33. 1. Distributed architecture > 3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 34. Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service Coordinator Coordinator HA
  • 36. The problems to use BI tools > BI tools need ODBC or JDBC connectivity > Tableau, IBM Cognos, QlickView, Chart.IO, ... > JasperSoft, Pentaho, MotionBoard, ... > ODBC/JDBC is VERY COMPLICATED > Matured implementation needs LONG time
  • 37. A solution: PostgreSQL protocol > Creating a PostgreSQL protocol gateway > Using PostgreSQL’s stable ODBC / JDBC driver https://github.com/treasure-data/prestogres
  • 38. How Prestogres works? 2. select run_presto_as_temp_table( ‘presto_result’,‘SELECT COUNT(1) FROM tbl1’); pgpool-II + patchclient 1. SELECT COUNT(1) FROM tbl1 4. SELECT * FROM presto_result; PostgreSQL 3.“run_persto_as_temp_table”function runs query on Presto Coordinator
  • 39. Demo
  • 40. 2. Data visualization with Presto > Data visualization tools need ODBC/JDBC driver > but implemetation takes LONG time > A solution is to use PostgreSQL protocol > and use PostgreSQL’s ODBC/JDBC driver > Prestogres is already confirmed to work with some commertial BI tools
  • 42. Presto’s execution model > Presto is NOT MapReduce > Presto’s query plan is based on DAG > more like Apache Tez or traditional MPP databases
  • 43. How query runs? > Coordinator > SQL Parser > Query Planner > Execution planner > Workers > Task execution scheduler
  • 44. SQL SQL Parser AST Logical Planner Distributed Planner Logical Query Plan Execution Planner Discovery Server Connector Distributed Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata
  • 45. SQL SQL Parser SQL Distributed Planner Logical Query Plan Execution Planner Discovery Service Connector Query Plan Execution Plan Optimizer NodeManager ✓ node list ✓ table schema Metadata (today’s talk) Query Planner
  • 46. Query Planner SELECT name, count(*) AS c FROM impressions GROUP BY name SQL impressions ( name varchar time bigint ) Table schema Table scan (name:varchar) GROUP BY (name, count(*)) Output (name, c) + Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Logical query plan Distributed query plan
  • 47. Query Planner - Stages Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange inter-worker data transfer pipelined aggregation inter-worker data transfer Stage-0 Stage-1 Stage-2
  • 48. Sink Partial aggregation Table scan Sink Partial aggregation Table scan Execution Planner + Node list ✓ 2 workers Sink Final aggregation Exchange Output Exchange Sink Final aggregation Exchange Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Worker 1 Worker 2
  • 49. Execution Planner - Tasks Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Task 1 task / worker / stage ✓ All tasks in parallel Output Exchange Worker 1 Worker 2
  • 50. Execution Planner - Split Sink Final aggregation Exchange Sink Partial aggregation Table scan Sink Final aggregation Exchange Sink Partial aggregation Table scan Output Exchange Split many splits / task = many threads / worker (table scan) 1 split / task = 1 thread / worker Worker 1 Worker 2 1 split / worker = 1 thread / worker
  • 51. All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data to disk Wait between stages
  • 52. 3. Query Execution > SQL is converted into stages, tasks and splits > All tasks run in parallel > No wait time between stages (pipelined) > If one task fails, all tasks fail at once (query fails) > Memory-to-memory data transfer > No disk IO > If aggregated data doesn’t fit in memory, query fails • Note: query dies but worker doesn’t die. Memory consumption of all queries is fully managed
  • 53. 4. Monitoring & Configuration
  • 54. Monitoring > Web UI > basic query status check > JMX HTTP API > GET /v1/jmx/mbean[/{objectName}] • com.facebook.presto.execution:name=TaskManager • com.facebook.presto.execution:name=QueryManager • com.facebook.presto.execution:name=NodeScheduler > Event notification (remote logging) > POST http://remote.server/v2/event • query start, query complete, split complete
  • 55. Configuration > Execution planning (for coordinator) > query.initial-hash-partitions • max number of hash buckets (=tasks) of a GROUP BY (default: 8) > node-scheduler.min-candidates • max number of workers to run a stage in parallel (default: 10) > node-scheduler.include-coordinator • whether run tasks only on workers or include coordinator > query.schedule-split-batch-size • number of splits of a stage to start at once
  • 56. Configuration > Task execution (for workers) > task.cpu-timer-enabled • enable detailed statistics (causes some overhead) (default: true) > task.max-memory • memory limit of a task especially for hash tables used by GROUP BY and JOIN operations (default: 256MB) • enlarge if you get“Task exceeded max memory size”error > task.shard.max-threads • max number of threads of a worker to run active splits (default: number of CPU cores * 4)
  • 57. 5. Roadmap A report of Presto Meetup 2014 http://www.slideshare.net/dain1/presto-meetup-20140514-34731104 "Presto, Past, Present, and Future" by Dain Sundstrom at Facebook
  • 58. Presto’s future > Huge JOIN and GROUP BY > Spill to disk > Task recovery > CREATE VIEW (※implemented) > Native store (※implemented) > Fast data store in Presto workers > to cache hot data > Authentication and permissions
  • 59. Presto’s future > DDL/DML statements > CREATE TABLE with partitioning > DELETE and INSERT > Plugin repository > CLI plugin manager > JOIN and aggregation pushdown > Custom optimizers
  • 60. Links > Web site & document > http://prestodb.io > Mailing list > https://groups.google.com/group/presto-users > Github > https://github.com/facebook/presto > Guidelines for contribution > https://github.com/facebook/presto/blob/master/CONTRIBUTING.md
  • 61. Check: www.treasuredata.com Cloud service for the entire data pipeline, including Presto. We’re hiring!