SlideShare a Scribd company logo
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Treasure  Data  Inc.
Research  Engineer
Makoto  YUI  @myui
2015/05/14
TD  tech  talk  #3  @Retty 1
http://myui.github.io/
20  min.  Introduction  to  Hivemall
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Ø2015/04  Joined  Treasure  Data,  Inc.
Ø1st Research  Engineer  in  Treasure  Data
ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)  
Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute  
of  Advanced  Industrial  Science  and  Technology,  Japan.  
ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel  
Databases  
Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST
Ø My  research  topic  was  about  building  XML  native  database  and  
Parallel  Database  systems
ØSuper  programmer  award  from  the  MITOU  Foundation  
(a  Government  founded  program  for  finding  young  and  
talented  programmers)
Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida
2
Who  am    I  ?
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
3
0
2000
4000
6000
8000
10000
12000
Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12
Jan-­‐13Feb-­‐13M
ar-­‐13Apr-­‐13M
ay-­‐13Jun-­‐13
Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13
Jan-­‐14Feb-­‐14M
ar-­‐14Apr-­‐14M
ay-­‐14Jun-­‐14
Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14
Billion  records  (Unit)
Service  in
Series  A  Funding
Reached  100  customers
Selected  as  “Cool  Vendor  
in  Big  Data”  by  Gartner
10  trillion
records  
5  trillion  records
Figures on Oct. 2014
4 hundred thousand (40万) records Imported for each SECOND!!
10+ trillion (10兆) records Total number of imported records
12 billion (120億) records # records sent by an Ad-tech company
Figures  of  Imported  Data  in  Treasure  Data
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
The  latest  numbers  in  Treasure  Data
100+
Customers
In Japan
15 trillion
# of
stored records
4,000
A single company
sends data to us
from 4,000 nodes
500,000
# of records
stored per a second
4
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
5
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
What  is  Hivemall
Scalable  machine  learning  library  built  on  the  top  of  
Apache  Hive,  licensed  under  the  Apache  License  v2
Hadoop  HDFS
MapReduce
(MRv1)
Hive /  PIG
Hivemall
Apache  YARN
Apache  Tez
DAG processing
MR v2
Machine  Learning
Check  http://github.com/myui/hivemall
6
Query  Processing
Parallel  Data  
Processing  Framework
Resource  Management
Distributed  File  System
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
R
M MM
M
HDFS
HDFS
M M M
R
M M M
R
HDFS
M MM
M M
HDFS
R
MapReduce  and  DAG  engine
MapReduce   DAG  engine
Tez/Spark
No  intermediate  DFS  reads/writes!
7
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Very  easy  to  use;  Machine  Learning  on  SQL
The  key  characteristic  of  Hivemall
100+  lines
of  code
Classification  with  Mahout
CREATE  TABLE  lr_model AS
SELECT
feature,  -­‐-­‐ reducers  perform  model  averaging  in  
parallel
avg(weight)  as  weight
FROM  (
SELECT  logress(features,label,..)  as  (feature,weight)
FROM  train
)  t  -­‐-­‐ map-­‐only  task
GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers
ü Machine  Learning  made  easy  for  SQL  
developers  (ML  for  the  rest  of  us)
ü APIs  are  very  stable  because  of  SQL  
abstraction
This  SQL  query  automatically  runs  in  parallel
on  Hadoop  
8
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
List  of  functions  in  Hivemall  v0.3
9
• Classification  (both  
binary-­‐ and  multi-­‐class)
ü Perceptron
ü Passive  Aggressive  (PA)
ü Confidence  Weighted  (CW)
ü Adaptive  Regularization  of  
Weight  Vectors  (AROW)
ü Soft  Confidence  Weighted  (SCW)
ü AdaGrad+RDA
• Regression
ü Logistic  Regression  (SGD)
ü PA  Regression
ü AROW  Regression
ü AdaGrad
ü AdaDELTA
• kNN and  Recommendation
ü Minhash and  b-­‐Bit  Minhash
(LSH  variant)
ü Similarity  Search  using  K-­‐NN
ü Matrix  Factorization
• Feature  engineering
ü Feature  hashing
ü Feature  scaling
(normalization,  z-­‐score)  
ü TF-­‐IDF  vectorizer
Treasure  Data  will  support  Hivemall
v0.3.1  in  the  next  week!  
bit.ly/hivemall-­‐mf
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
• Contribution  from  Daniel  Dai  (Pig  PMC)  from  
Hortonworks
• To  be  supported  from  Pig  0.15
10
Hivemall  on  Apache  Pig
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
11
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Data  preparation
12
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Create external table e2006tfidf_train (
rowid int,
label float,
features ARRAY<STRING>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
COLLECTION ITEMS TERMINATED BY ",“
STORED AS TEXTFILE LOCATION '/dataset/E2006-
tfidf/train';
How  to  use  Hivemall  -­‐ Data  preparation
Define  a  Hive  table  for  training/testing  data
13
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Feature  Engineering
14
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create view e2006tfidf_train_scaled
as
select
rowid,
rescale(target,${min_label},${max_label})
as label,
features
from
e2006tfidf_train;
Applying a Min-Max Feature Normalization
How  to  use  Hivemall  -­‐ Feature  Engineering
Transforming  a  label  value  
to  a  value  between  0.0  and  1.0
15
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Training
16
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE lr_model AS
SELECT
feature,
avg(weight) as weight
FROM (
SELECT logress(features,label,..)
as (feature,weight)
FROM train
) t
GROUP BY feature
Training  by  logistic  regression
map-­‐only  task  to  learn  a  prediction  model
Shuffle  map-­‐outputs  to  reduces  by  feature
Reducers  perform  model  averaging  
in  parallel
17
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Training
CREATE TABLE news20b_cw_model1 AS
SELECT
feature,
voted_avg(weight) as weight
FROM
(SELECT
train_cw(features,label)
as (feature,weight)
FROM
news20b_train
) t
GROUP BY feature
Training  of  Confidence  Weighted  Classifier
Vote  to  use  negative  or  positive  
weights  for  avg
+0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7
Training  for  the  CW  classifier
18
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
create table news20mc_ensemble_model1 as
select
label,
cast(feature as int) as feature,
cast(voted_avg(weight) as float) as weight
from
(select
train_multiclass_cw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_arow(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
union all
select
train_multiclass_scw(addBias(features),label)
as (label,feature,weight)
from
news20mc_train_x3
) t
group by label, feature;
Ensemble  learning  for  stable  prediction  performance
Just  stack  prediction  models  
by  union  all
19
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Training
Prediction
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Prediction
20
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall  -­‐ Prediction
CREATE TABLE lr_predict
as
SELECT
t.rowid,
sigmoid(sum(m.weight)) as prob
FROM
testing_exploded t LEFT OUTER JOIN
lr_model m ON (t.feature = m.feature)
GROUP BY
t.rowid
Prediction  is  done  by  LEFT  OUTER  JOIN
between  test  data  and  prediction  model
No  need  to  load  the  entire  model  into  memory
21
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Plan  of  the  Talk
1. Brief  introduction  to  Hivemall
2. How  to  use  Hivemall
3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS
22
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Type/Purpose  Matrix  of  Machine  Learning
23
Online
Learning
Offline
Learning
Online
Prediction
• Algorithm Trade  (HFT)
• Twitter  real-­‐time  
analysis
• Ad-­‐tech (e.g.,  CTR/CVR  
prediction)
• Real-­‐time  
recommendation
Offline
Prediction
no/fewneeds?
• Daily/weeklybatch  
systems
• Business
Analytics/Reporting
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
How  to  use  Hivemall
Machine
Learning
Batch Training on Hadoop
Online Prediction on RDBMS
Prediction
Model
Label
Feature  Vector
Feature  Vector
Label
Export  
prediction  model
24
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Export  Prediction  Model  to  a  RDBMS
25
hive> desc news20b_cw_model1;
feature int
weight double
Any  RDBMS
TD  export
Periodical  export  is  very easy
in  Treasure  Data
103 -0.4896543622016907
104 -0.0955817922949791
105 0.12560302019119263
106 0.09214721620082855
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
26
hive>  desc  testing_exploded;                                                    
feature                                  string  
value                                      float
Real-­‐time  Prediction  on  MySQL
#2  Preparing  a  Test  data  table
SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x))
Prediction
Model
Label
Feature  Vector
SELECT    
sigmoid(sum(t.value   *  m.weight))  as  prob
FROM
testing_exploded   t  LEFT  OUTER  JOIN  
prediction_model   m  ON  (t.feature  =  m.feature)
#3  Online  prediction  on  MySQL  
You  can  alternatively  use  SQL  view
defining  for  testing  target
Index  lookups  are  very
efficient  in  RDBMSs
http://bit.ly/hivemall-­‐rtp
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
Cost  of  Amazon  Machine  Learning
Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit
(single  process)  
27
Data  Analysis  and  Model  Building  Fees
$0.42/Instance  per  Hour
Batch  Prediction
$0.1/1000 requests
Real-­‐time  Prediction
$0.0001  per  a  request
Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for  
each  web  request  (e.g.  online  CTR  prediction)
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
28
Real-­‐time  Prediction  on  Treasure  Data
Run  batch  training
job  periodically
Real-­‐time  prediction
on  a  RDBMS
Periodical
export
Copyright  ©2015  Treasure  Data.    All  Rights  Reserved.
29
Beyond  Query-­‐as-­‐a-­‐Service!
We  ❤️ Open-­‐source!  We  invented  ..
We  are  Hiring!

More Related Content

What's hot (8)

3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
Makoto Yui
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
Nishant Gandhi
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
Srivatsan Ramanujam
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
A First Look at HPC Midlands
A First Look at HPC MidlandsA First Look at HPC Midlands
A First Look at HPC Midlands
Martin Hamilton
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
3rd Hivemall meetup
3rd Hivemall meetup3rd Hivemall meetup
3rd Hivemall meetup
Makoto Yui
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
A First Look at HPC Midlands
A First Look at HPC MidlandsA First Look at HPC Midlands
A First Look at HPC Midlands
Martin Hamilton
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 

Similar to Hivemall Talk at TD tech talk #3 (20)

Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
huguk
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
Aki Ariga
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Makoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
Makoto Yui
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
Makoto Yui
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
DataWorks Summit
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
IJDKP
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
Steve Keil
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
Stephan Reimann
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
William Markito Oliveira
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
huguk
 
Managing Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure DataManaging Machine Learning workflows on Treasure Data
Managing Machine Learning workflows on Treasure Data
Aki Ariga
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Srivatsan Ramanujam
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Makoto Yui
 
Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17Hivemall meets Digdag @Hackertackle 2018-02-17
Hivemall meets Digdag @Hackertackle 2018-02-17
Makoto Yui
 
Apache Hivemall and my OSS experience
Apache Hivemall and my OSS experienceApache Hivemall and my OSS experience
Apache Hivemall and my OSS experience
Makoto Yui
 
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Srivatsan Ramanujam
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Esther Vasiete
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
Jim Dowling
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
IJDKP
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
Steve Keil
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Adam Muise
 
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET Journal
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
Stephan Reimann
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
William Markito Oliveira
 

More from Makoto Yui (20)

Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
Makoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
Makoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
Makoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
Makoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
Makoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
Makoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
Makoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
Makoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
Makoto Yui
 
Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6Introduction to Apache Hivemall v0.5.2 and v0.6
Introduction to Apache Hivemall v0.5.2 and v0.6
Makoto Yui
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Idea behind Apache Hivemall
Idea behind Apache HivemallIdea behind Apache Hivemall
Idea behind Apache Hivemall
Makoto Yui
 
What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0What's new in Hivemall v0.5.0
What's new in Hivemall v0.5.0
Makoto Yui
 
What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0What's new in Apache Hivemall v0.5.0
What's new in Apache Hivemall v0.5.0
Makoto Yui
 
Revisiting b+-trees
Revisiting b+-treesRevisiting b+-trees
Revisiting b+-trees
Makoto Yui
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
Makoto Yui
 
Apache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, MiamiApache Hivemall @ Apache BigData '17, Miami
Apache Hivemall @ Apache BigData '17, Miami
Makoto Yui
 
機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会機械学習のデータ並列処理@第7回BDI研究会
機械学習のデータ並列処理@第7回BDI研究会
Makoto Yui
 
Podling Hivemall in the Apache Incubator
Podling Hivemall in the Apache IncubatorPodling Hivemall in the Apache Incubator
Podling Hivemall in the Apache Incubator
Makoto Yui
 
Dots20161029 myui
Dots20161029 myuiDots20161029 myui
Dots20161029 myui
Makoto Yui
 
Hadoopsummit16 myui
Hadoopsummit16 myuiHadoopsummit16 myui
Hadoopsummit16 myui
Makoto Yui
 
HadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myuiHadoopCon'16, Taipei @myui
HadoopCon'16, Taipei @myui
Makoto Yui
 
Recommendation 101 using Hivemall
Recommendation 101 using HivemallRecommendation 101 using Hivemall
Recommendation 101 using Hivemall
Makoto Yui
 
Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016Hivemall dbtechshowcase 20160713 #dbts2016
Hivemall dbtechshowcase 20160713 #dbts2016
Makoto Yui
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
Makoto Yui
 
Tdtechtalk20160425myui
Tdtechtalk20160425myuiTdtechtalk20160425myui
Tdtechtalk20160425myui
Makoto Yui
 
Tdtechtalk20160330myui
Tdtechtalk20160330myuiTdtechtalk20160330myui
Tdtechtalk20160330myui
Makoto Yui
 
Datascientistsymp1113
Datascientistsymp1113Datascientistsymp1113
Datascientistsymp1113
Makoto Yui
 
2nd Hivemall meetup 20151020
2nd Hivemall meetup 201510202nd Hivemall meetup 20151020
2nd Hivemall meetup 20151020
Makoto Yui
 

Recently uploaded (20)

Data recovery and Digital evidence controls in digital frensics.pdf
Data recovery and Digital evidence controls in digital frensics.pdfData recovery and Digital evidence controls in digital frensics.pdf
Data recovery and Digital evidence controls in digital frensics.pdf
Abhijit Bodhe
 
Improving Surgical Robot Performance Through Seal Design.pdf
Improving Surgical Robot Performance Through Seal Design.pdfImproving Surgical Robot Performance Through Seal Design.pdf
Improving Surgical Robot Performance Through Seal Design.pdf
BSEmarketing
 
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
Thane Heins NOBEL PRIZE WINNING ENERGY RESEARCHER
 
Von karman Equation full derivation .pdf
Von karman Equation full derivation  .pdfVon karman Equation full derivation  .pdf
Von karman Equation full derivation .pdf
Er. Gurmeet Singh
 
IoT-based-Electrical-Motor-Fault-Detection-System.pptx
IoT-based-Electrical-Motor-Fault-Detection-System.pptxIoT-based-Electrical-Motor-Fault-Detection-System.pptx
IoT-based-Electrical-Motor-Fault-Detection-System.pptx
atharvapardeshi03
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
samueljackson3773
 
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
SnehPrasad2
 
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
prasadmutkule1
 
GREEN BULIDING PPT FOR THE REFRENACE.PPT
GREEN BULIDING PPT FOR THE REFRENACE.PPTGREEN BULIDING PPT FOR THE REFRENACE.PPT
GREEN BULIDING PPT FOR THE REFRENACE.PPT
kamalkeerthan61
 
Turbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdfTurbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdf
Totok Sulistiyanto
 
GM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptxGM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptx
crdslalcomumbai
 
PPt physics -GD.pptx gd topic for physics btech
PPt physics -GD.pptx gd topic for physics btechPPt physics -GD.pptx gd topic for physics btech
PPt physics -GD.pptx gd topic for physics btech
kavyamittal2201735
 
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
 
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
prasadmutkule1
 
Introduction to Safety, Health & Environment
Introduction to Safety, Health  & EnvironmentIntroduction to Safety, Health  & Environment
Introduction to Safety, Health & Environment
ssuserc606c7
 
Taykon-Kalite belgeleri
Taykon-Kalite belgeleriTaykon-Kalite belgeleri
Taykon-Kalite belgeleri
TAYKON
 
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
 
Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...
dhanashree78
 
Soil Properties and Methods of Determination
Soil Properties and  Methods of DeterminationSoil Properties and  Methods of Determination
Soil Properties and Methods of Determination
Rajani Vyawahare
 
Sppu engineering artificial intelligence and data science semester 6th Artif...
Sppu engineering  artificial intelligence and data science semester 6th Artif...Sppu engineering  artificial intelligence and data science semester 6th Artif...
Sppu engineering artificial intelligence and data science semester 6th Artif...
pawaletrupti434
 
Data recovery and Digital evidence controls in digital frensics.pdf
Data recovery and Digital evidence controls in digital frensics.pdfData recovery and Digital evidence controls in digital frensics.pdf
Data recovery and Digital evidence controls in digital frensics.pdf
Abhijit Bodhe
 
Improving Surgical Robot Performance Through Seal Design.pdf
Improving Surgical Robot Performance Through Seal Design.pdfImproving Surgical Robot Performance Through Seal Design.pdf
Improving Surgical Robot Performance Through Seal Design.pdf
BSEmarketing
 
Von karman Equation full derivation .pdf
Von karman Equation full derivation  .pdfVon karman Equation full derivation  .pdf
Von karman Equation full derivation .pdf
Er. Gurmeet Singh
 
IoT-based-Electrical-Motor-Fault-Detection-System.pptx
IoT-based-Electrical-Motor-Fault-Detection-System.pptxIoT-based-Electrical-Motor-Fault-Detection-System.pptx
IoT-based-Electrical-Motor-Fault-Detection-System.pptx
atharvapardeshi03
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
samueljackson3773
 
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
Common Network Architecture:X.25 Networks, Ethernet (Standard and Fast): fram...
SnehPrasad2
 
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
ESIT135 Problem Solving Using Python Notes of Unit-2 and Unit-3
prasadmutkule1
 
GREEN BULIDING PPT FOR THE REFRENACE.PPT
GREEN BULIDING PPT FOR THE REFRENACE.PPTGREEN BULIDING PPT FOR THE REFRENACE.PPT
GREEN BULIDING PPT FOR THE REFRENACE.PPT
kamalkeerthan61
 
Turbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdfTurbocor Product and Technology Review.pdf
Turbocor Product and Technology Review.pdf
Totok Sulistiyanto
 
GM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptxGM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptx
crdslalcomumbai
 
PPt physics -GD.pptx gd topic for physics btech
PPt physics -GD.pptx gd topic for physics btechPPt physics -GD.pptx gd topic for physics btech
PPt physics -GD.pptx gd topic for physics btech
kavyamittal2201735
 
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
 
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
ESIT135 Problem Solving Using Python Notes of Unit-1 and Unit-2
prasadmutkule1
 
Introduction to Safety, Health & Environment
Introduction to Safety, Health  & EnvironmentIntroduction to Safety, Health  & Environment
Introduction to Safety, Health & Environment
ssuserc606c7
 
Taykon-Kalite belgeleri
Taykon-Kalite belgeleriTaykon-Kalite belgeleri
Taykon-Kalite belgeleri
TAYKON
 
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
 
Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...
dhanashree78
 
Soil Properties and Methods of Determination
Soil Properties and  Methods of DeterminationSoil Properties and  Methods of Determination
Soil Properties and Methods of Determination
Rajani Vyawahare
 
Sppu engineering artificial intelligence and data science semester 6th Artif...
Sppu engineering  artificial intelligence and data science semester 6th Artif...Sppu engineering  artificial intelligence and data science semester 6th Artif...
Sppu engineering artificial intelligence and data science semester 6th Artif...
pawaletrupti434
 

Hivemall Talk at TD tech talk #3

  • 1. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Treasure  Data  Inc. Research  Engineer Makoto  YUI  @myui 2015/05/14 TD  tech  talk  #3  @Retty 1 http://myui.github.io/ 20  min.  Introduction  to  Hivemall
  • 2. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Ø2015/04  Joined  Treasure  Data,  Inc. Ø1st Research  Engineer  in  Treasure  Data ØMy  mission  in  TD  is  developing  ML-­‐as-­‐a-­‐Service  (MLaaS)   Ø2010/04-­‐2015/03  Senior  Researcher  at  National  Institute   of  Advanced  Industrial  Science  and  Technology,  Japan.   ØWorked  on  a  large-­‐scale  Machine  Learning  project  and  Parallel   Databases   Ø2009/03  Ph.D.  in  Computer  Science  from  NAIST Ø My  research  topic  was  about  building  XML  native  database  and   Parallel  Database  systems ØSuper  programmer  award  from  the  MITOU  Foundation   (a  Government  founded  program  for  finding  young  and   talented  programmers) Ø Super  creators  in  Treasure  Data:  Sada Furuhashi,  Keisuke  Nishida 2 Who  am    I  ?
  • 3. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 3 0 2000 4000 6000 8000 10000 12000 Aug-­‐12Sep-­‐12Oct-­‐12Nov-­‐12Dec-­‐12 Jan-­‐13Feb-­‐13M ar-­‐13Apr-­‐13M ay-­‐13Jun-­‐13 Jul-­‐13Aug-­‐13Sep-­‐13Oct-­‐13Nov-­‐13Dec-­‐13 Jan-­‐14Feb-­‐14M ar-­‐14Apr-­‐14M ay-­‐14Jun-­‐14 Jul-­‐14Aug-­‐14Sep-­‐14Oct-­‐14 Billion  records  (Unit) Service  in Series  A  Funding Reached  100  customers Selected  as  “Cool  Vendor   in  Big  Data”  by  Gartner 10  trillion records   5  trillion  records Figures on Oct. 2014 4 hundred thousand (40万) records Imported for each SECOND!! 10+ trillion (10兆) records Total number of imported records 12 billion (120億) records # records sent by an Ad-tech company Figures  of  Imported  Data  in  Treasure  Data
  • 4. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. The  latest  numbers  in  Treasure  Data 100+ Customers In Japan 15 trillion # of stored records 4,000 A single company sends data to us from 4,000 nodes 500,000 # of records stored per a second 4
  • 5. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 5
  • 6. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. What  is  Hivemall Scalable  machine  learning  library  built  on  the  top  of   Apache  Hive,  licensed  under  the  Apache  License  v2 Hadoop  HDFS MapReduce (MRv1) Hive /  PIG Hivemall Apache  YARN Apache  Tez DAG processing MR v2 Machine  Learning Check  http://github.com/myui/hivemall 6 Query  Processing Parallel  Data   Processing  Framework Resource  Management Distributed  File  System
  • 7. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. R M MM M HDFS HDFS M M M R M M M R HDFS M MM M M HDFS R MapReduce  and  DAG  engine MapReduce   DAG  engine Tez/Spark No  intermediate  DFS  reads/writes! 7
  • 8. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Very  easy  to  use;  Machine  Learning  on  SQL The  key  characteristic  of  Hivemall 100+  lines of  code Classification  with  Mahout CREATE  TABLE  lr_model AS SELECT feature,  -­‐-­‐ reducers  perform  model  averaging  in   parallel avg(weight)  as  weight FROM  ( SELECT  logress(features,label,..)  as  (feature,weight) FROM  train )  t  -­‐-­‐ map-­‐only  task GROUP  BY  feature;  -­‐-­‐ shuffled  to  reducers ü Machine  Learning  made  easy  for  SQL   developers  (ML  for  the  rest  of  us) ü APIs  are  very  stable  because  of  SQL   abstraction This  SQL  query  automatically  runs  in  parallel on  Hadoop   8
  • 9. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. List  of  functions  in  Hivemall  v0.3 9 • Classification  (both   binary-­‐ and  multi-­‐class) ü Perceptron ü Passive  Aggressive  (PA) ü Confidence  Weighted  (CW) ü Adaptive  Regularization  of   Weight  Vectors  (AROW) ü Soft  Confidence  Weighted  (SCW) ü AdaGrad+RDA • Regression ü Logistic  Regression  (SGD) ü PA  Regression ü AROW  Regression ü AdaGrad ü AdaDELTA • kNN and  Recommendation ü Minhash and  b-­‐Bit  Minhash (LSH  variant) ü Similarity  Search  using  K-­‐NN ü Matrix  Factorization • Feature  engineering ü Feature  hashing ü Feature  scaling (normalization,  z-­‐score)   ü TF-­‐IDF  vectorizer Treasure  Data  will  support  Hivemall v0.3.1  in  the  next  week!   bit.ly/hivemall-­‐mf
  • 10. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. • Contribution  from  Daniel  Dai  (Pig  PMC)  from   Hortonworks • To  be  supported  from  Pig  0.15 10 Hivemall  on  Apache  Pig
  • 11. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 11
  • 12. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Data  preparation 12
  • 13. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Create external table e2006tfidf_train ( rowid int, label float, features ARRAY<STRING> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '¥t' COLLECTION ITEMS TERMINATED BY ",“ STORED AS TEXTFILE LOCATION '/dataset/E2006- tfidf/train'; How  to  use  Hivemall  -­‐ Data  preparation Define  a  Hive  table  for  training/testing  data 13
  • 14. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Feature  Engineering 14
  • 15. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create view e2006tfidf_train_scaled as select rowid, rescale(target,${min_label},${max_label}) as label, features from e2006tfidf_train; Applying a Min-Max Feature Normalization How  to  use  Hivemall  -­‐ Feature  Engineering Transforming  a  label  value   to  a  value  between  0.0  and  1.0 15
  • 16. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Training 16
  • 17. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE lr_model AS SELECT feature, avg(weight) as weight FROM ( SELECT logress(features,label,..) as (feature,weight) FROM train ) t GROUP BY feature Training  by  logistic  regression map-­‐only  task  to  learn  a  prediction  model Shuffle  map-­‐outputs  to  reduces  by  feature Reducers  perform  model  averaging   in  parallel 17
  • 18. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Training CREATE TABLE news20b_cw_model1 AS SELECT feature, voted_avg(weight) as weight FROM (SELECT train_cw(features,label) as (feature,weight) FROM news20b_train ) t GROUP BY feature Training  of  Confidence  Weighted  Classifier Vote  to  use  negative  or  positive   weights  for  avg +0.7,  +0.3,  +0.2,  -­‐0.1,  +0.7 Training  for  the  CW  classifier 18
  • 19. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. create table news20mc_ensemble_model1 as select label, cast(feature as int) as feature, cast(voted_avg(weight) as float) as weight from (select train_multiclass_cw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_arow(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 union all select train_multiclass_scw(addBias(features),label) as (label,feature,weight) from news20mc_train_x3 ) t group by label, feature; Ensemble  learning  for  stable  prediction  performance Just  stack  prediction  models   by  union  all 19
  • 20. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Training Prediction Prediction Model Label Feature  Vector Feature  Vector Label Prediction 20
  • 21. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall  -­‐ Prediction CREATE TABLE lr_predict as SELECT t.rowid, sigmoid(sum(m.weight)) as prob FROM testing_exploded t LEFT OUTER JOIN lr_model m ON (t.feature = m.feature) GROUP BY t.rowid Prediction  is  done  by  LEFT  OUTER  JOIN between  test  data  and  prediction  model No  need  to  load  the  entire  model  into  memory 21
  • 22. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Plan  of  the  Talk 1. Brief  introduction  to  Hivemall 2. How  to  use  Hivemall 3. Real-­‐time  prediction  w/  Hivemall  and  RDBMS 22
  • 23. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Type/Purpose  Matrix  of  Machine  Learning 23 Online Learning Offline Learning Online Prediction • Algorithm Trade  (HFT) • Twitter  real-­‐time   analysis • Ad-­‐tech (e.g.,  CTR/CVR   prediction) • Real-­‐time   recommendation Offline Prediction no/fewneeds? • Daily/weeklybatch   systems • Business Analytics/Reporting
  • 24. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. How  to  use  Hivemall Machine Learning Batch Training on Hadoop Online Prediction on RDBMS Prediction Model Label Feature  Vector Feature  Vector Label Export   prediction  model 24
  • 25. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Export  Prediction  Model  to  a  RDBMS 25 hive> desc news20b_cw_model1; feature int weight double Any  RDBMS TD  export Periodical  export  is  very easy in  Treasure  Data 103 -0.4896543622016907 104 -0.0955817922949791 105 0.12560302019119263 106 0.09214721620082855
  • 26. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 26 hive>  desc  testing_exploded;                                                     feature                                  string   value                                      float Real-­‐time  Prediction  on  MySQL #2  Preparing  a  Test  data  table SIGMOID(x) =  1.0  /  (1.0  +  exp(-­‐x)) Prediction Model Label Feature  Vector SELECT     sigmoid(sum(t.value   *  m.weight))  as  prob FROM testing_exploded   t  LEFT  OUTER  JOIN   prediction_model   m  ON  (t.feature  =  m.feature) #3  Online  prediction  on  MySQL   You  can  alternatively  use  SQL  view defining  for  testing  target Index  lookups  are  very efficient  in  RDBMSs http://bit.ly/hivemall-­‐rtp
  • 27. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. Cost  of  Amazon  Machine  Learning Amazon-­‐ML  is  suspected  to  be  based  on  Vowpal Wabbit (single  process)   27 Data  Analysis  and  Model  Building  Fees $0.42/Instance  per  Hour Batch  Prediction $0.1/1000 requests Real-­‐time  Prediction $0.0001  per  a  request Pay-­‐per-­‐request    is  apparently  not  suitable  for  doing  prediction  for   each  web  request  (e.g.  online  CTR  prediction)
  • 28. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 28 Real-­‐time  Prediction  on  Treasure  Data Run  batch  training job  periodically Real-­‐time  prediction on  a  RDBMS Periodical export
  • 29. Copyright  ©2015  Treasure  Data.    All  Rights  Reserved. 29 Beyond  Query-­‐as-­‐a-­‐Service! We  ❤️ Open-­‐source!  We  invented  .. We  are  Hiring!