Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

Ruby会議でSQLの話をする
のは間違っているだろうか
Minero Aoki

今日のお話について
Theme of this session

「技術的に濃い
話題がいいです」
Akira Matsuda said
I expect you deep technical talk.

「濃い話」is 何
What s deep talk???

Rubyの実装の話とか
もう別に濃くない
Ruby implementation is not deep already,
so I speak about another theme.

25分でわかる
ビッグデータ分析
∼MapReduce追悼∼
Big Data Analytics in 25 minutes

トータル100TBくらいの
データを分析するとしよう
Suppose you must analyze 100TB text data

1CPUとかもうマヂムリ…
コンピュータ
プログラム
データ
Single CPU cannot handle 100TB

そうだ分散処理しよう
ノード0 ノード1 ノード2 ノード3
プログラムプログラムプログラムプログラム
データデータデータデータ
You need more computers
(distributed processing)

でも分散処理って
めんどい…マヂムリ…
Distributed processing is too diﬃcult…

そこで並列RDBですよ
Parallel RDB may help you.

Parallel RDB
Node 0 Node 1 Node 2 Node 3
Front
End
Front
End
Front
End
Front
End
Back
End
Back
End
Back
End
Back
End

並列RDBの特長
1 ノードを増やせば線形に速くなる！
2 標準SQLが使える！
3 クライアントからは1台に見える！
has linear scalability
You can use SQL
Looks as the one computer
Parallel RDB is great because…

並列RDB超スゴイ
age age マック
Parallel RDB is great

いろいろな商用並列RDB
Database Vendor Since
Teradata Teradata 1983
Teradata Aster Teradata 2005
PureData System for Analytics IBM 2000
Exadata Oracle 2008
Greenplum Pivotal 2003
SQL Server PDW Microsoft 2010くらい
Redshift Amazon 2012
Various parallel RDBs

Hadoop Architecture
HDFS: Distributed File System
MapReduce: Compute Framework
(Hive: SQL interface)

Hadoopの特徴
1 データはテーブルではなくファイル
2 処理にはMapReduceを使う（使っていた）
3 Hiveを乗せるとSQLっぽい言語でも書ける
Hadoop data is plain ﬁle
Processed by MapReduce
Hive allows you to write SQL-like query

猫も子も
MapReduce
Big Data meant MapReduce few years ago

MapReduce
k1, v1
k2, v2
k3, v3
k4, v4
k5, v5
k6, v6
M
a
p
k'1, v'1
k'1, v'2
k'1, v'3
k'2, v'4
k'3, v'5
k'3, v'6
k''1, v''1
k''2, v''2
k''3, v''3
R
e
d
u
ce

Map関数とReduce関数を
書いたらよしなに
分散してくれるフレームワーク
You just write Map&Reduce functions,
Hadoop serves the rest

Q1.
SQLとMapReduce
どっちがいいの？！
Which is good, SQL and MapReduce

ビジネス的な答え：
SQL

なぜSQLか
1 SQL書けてもJava書けない人は多い
2 既存のSQLを使ったアプリが動かない
3 MapReduce関数を書くのは高コスト
Many people can write SQL but not Java
Many applications rely on SQL
Writing MR function needs more time

コスト差ってどれくらいよ
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
!public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private ﬁnal static IntWritable one = new IntWritable(1);
private Text word = new Text();
! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
!public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
!public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
! conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
! conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
! conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
! FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
! JobClient.runJob(conf);
}
}
select count(*) from (
select regexp_split_to_table(str, s+')
from text_table
) t;
MapReduceによるWordCount SQLによるWordCount

実際SQLが勝った。
Now SQL beats MapReduce

Q2.
Hadoopと並列RDBは
どっちがいいんですか！！！
Which is good, Hadoop or parallel RDB?

速度は並列RDB
データ構造はHadoop
Parallel RDB is faster; Hadoop is more ﬂexible

現在ありがちな構成
HDFS
MapReduce
Hive

今後の構成
HDFS
impala backend
impala frontend

Hadoopは並列RDBに
似てきている
DB ﬁlesystem
backend
parser, planner
HDFS
impala be
impala fe
Hadoop resembles to parallel RDB now

Hybrid DB comes
in near future

Q3.
MapReduceは
お亡くなりですか？
MapReduce is dead?

まだだっ……
まだ終わらんよ！！
No

MapReduceは
並列処理にJavaやCを
はさみこめる
MapReduce has better extendability

SQLからMapReduce呼べる
select count(distinct user_id)
from npath(
on clicks
partition by user_id
order by timestamp
mode(overlapping)
pattern( H.S.P )
symbols(
page_type = home AS H,
page_type = search AS S,
page_type = product AS P)
result(first(user_id of H) as user_id)
);
最近、Hiveにもnpath入りました
(MatchPath)
You can combine MapReduce with SQL

Easy & Handy SQL
+
Extendable MapReduce

よいものはよい
Great product is anywhere

だが知識は偏在している
but knowledge is maldistributed

OSS World Enterprise World
Hadoop
Ruby
Python Excel
並列RDB
Windows
markdown VB
Git
Java

Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか

More Related Content

Oedo Ruby Conference 04: Ruby会議でSQLの話をするのは間違っているだろうか