5. / 76
A
x 人
B
x 人
C
x 人
D
x 人
会場への質問
5
Spark経験有りSpark経験無し
SQL経験有り
SQL経験無し
SQLのSELECT句, FROM句, WHERE句の意味が
分かる
何らかの方法(spark-submit, spark-shell, ...)で
Sparkジョブを実行したことがある
SQL経験有り
Spark経験有り
12. / 76
general execution graphs?
12
http://spark.apache.org/docs/latest/
It provides high-level APIs in Java, Scala, Python and R, and an
optimized engine that supports general execution graphs.
15. / 76
general-purpose cluster computing system??
15
http://spark.apache.org/docs/latest/
Apache Spark is a fast and general-purpose cluster computing system.
26. / 76
fast???
• MapReduce以外との比較
• スケールアウト可能なSQLエンジンのベンチマークテスト:
Presto vs Spark SQL vs Hive on Tez
http://recruit.gmo.jp/engineer/jisedai/blog/presto_spark_hive/
26
SparkとTezが同じくらい、
Prestoはその3倍速い
27. / 76
fast???
• MapReduce以外との比較
• PageRankアルゴリズムのベンチマーク
27
http://www.cl.cam.ac.uk/~ey204/teaching/ACS/R212_2013_2014/presentation/S8/Niko_GraphLabx.pdf
Mahout(MapReduce) vs
GraphX(Spark) vs
PowerGraph(GraphLab)
MapReduceよりは速いが、
GraphLabよりは遅い
29. / 76
a rich set of higher-level tools
29
http://spark.apache.org/docs/latest/
It also supports a rich set of higher-level tools including Spark SQL for
SQL and structured data processing, MLlib for machine learning,
GraphX for graph processing, and Spark Streaming.
30. / 76
a rich set of higher-level tools
• 主なSparkコンポーネントと、その競合ソフトウェア
30
分野 Sparkコンポーネント 他のソフトウェア
SQL Spark SQL Presto, Hive on Tez, Redshift, BigQuery, ...
機械学習 MLlib TensorFlow, Torch, Mxnet, ...
グラフ処理 GraphX Graphlab, Titan, Microsoft Graph
Engine, ...
ストリーム処理 Spark Streaming Apache Flink, Apache Storm, ...
それぞれの分野で見た場合、特出した機能がある訳ではない
45. / 76
実践SQL: ABC分析
• 月別売上の集計
45
WITH
monthly_sales AS (
SELECT item, SUM(amount) AS amount
FROM purchase_log
WHERE
dt BETWEEN '2017-03-01' AND '2017-03-31'
GROUP BY item
)
SELECT * FROM monthly_sales;
item amount
Onion 185,000
Tomato 168,000
Green Onion 130,000
Eggplant 98,000
… …
46. / 76
実践SQL: ABC分析
• 構成比、構成比累計の計算
46
WITH
monthly_sales AS ( ... )
, sales_composition_ratio AS (
SELECT item, amount
, 100.0 * amount
/ NULLIF(SUM(amount) OVER(), 0)
AS comp_ratio
, 100.0 * SUM(amount) OVER(ORDER BY amount DESC)
/ NULLIF(SUM(amount) OVER(), 0)
AS cum_ratio
FROM
monthly_sales
)
SELECT * FROM sales_composition_ratio;
item amount comp_ratio cum_ratio
Onion 185,000 19.331 19.331
Tomato 168,000 17.554 36.886
Green Onion 130,000 13.584 50.470
ウィンドウ関数で全
体売上・累計売上
を計算
分母が0のときは
NULLに変換
47. / 76
実践SQL: ABC分析
• ABCランクの判定
47
WITH
monthly_sales AS ( ... )
, sales_composition_ratio AS ( ... )
SELECT
item
, cum_ratio
, CASE
WHEN cum_ratio BETWEEN 0 AND 70 THEN 'A'
WHEN cum_ratio BETWEEN 70 AND 90 THEN 'B'
WHEN cum_ratio BETWEEN 90 AND 100 THEN 'C'
END AS abc_rank
FROM sales_composition_ratio;
item cum_ratio abc_rank
Onion 19.331 A
Tomato 36.886 A
… … …
Carrot 74.721 B
… … …
CASE式で値を
変換
49. / 76
実践SQL:検索ログ分析
• 検索アクションの次のアクションを補完する
49
WITH
search_log_with_next_action AS (
SELECT action, keyword, result_num
, LEAD(action) OVER(PARTITION BY session ORDER BY stamp) AS next_action
, LEAD(keyword) OVER(PARTITION BY session ORDER BY stamp) AS next_keyword
, LEAD(result_num) OVER(PARTITION BY session ORDER BY stamp) AS next_result_num
FROM search_log
)
SELECT * FROM search_log_with_next_action WHERE action = 'search';
action keyword result_num next_action next_keyword next_result_num
Search Pstgres 0 Search Postgres 889
Search Postgres 889 Click
Search Hive 864
… … …
再検索
クリック
離脱
LEAD関数で次のレ
コードの値を取得
50. / 76
WITH
search_log_with_next_action AS ( ... )
SELECT
keyword
, COUNT(1) AS search_count
, AVG(CASE WHEN next_action = 'search' THEN 1.0 ELSE 0.0 END) AS retry_rate
, AVG(CASE WHEN next_action = 'click' THEN 1.0 ELSE 0.0 END) AS click_rate
, AVG(CASE WHEN next_action IS NULL THEN 1.0 ELSE 0.0 END) AS exit_rate
FROM search_log_with_next_action
WHERE action = 'search'
GROUP BY keyword;
実践SQL:検索ログ分析
• クリック率、再検索率、離脱率を同時に計算する
50
keyword search_count retry_rate click_rate exit_rate
Postgres 1082 0.2145 0.1548 0.6307
… … …
0除算の考慮
は不要
CASE式で0, 1 のフラグ
に変換し、AVGで集約
69. / 76
実践SparkSQL: ABC分析
• UDFの定義
69
// ABC分析のランク判定用UDF
val abcUdf = udf { r: Double =>
r match {
case r if (0 <= r && r <= 70) => "A"
case r if (70 < r && r <= 90) => "B"
case r if (90 < r && r <= 100) => "C"
case _ => ""
}
}
71. / 76
実践SparkSQL: ABC分析
• SQLの記述との比較
71
WITH
sales_composition_ratio AS (
SELECT item, amount
, 100.0 * amount / NULLIF(SUM(amount) OVER(), 0) AS comp_ratio
, 100.0 * SUM(amount) OVER(ORDER BY amount DESC)
/ NULLIF(SUM(amount) OVER(), 0) AS cum_ratio
FROM monthly_sales
)
SELECT item, cum_ratio
, CASE
WHEN cum_ratio BETWEEN 0 AND 70 THEN 'A'
WHEN cum_ratio BETWEEN 70 AND 90 THEN 'B'
WHEN cum_ratio BETWEEN 90 AND 100 THEN 'C'
END AS abc_rank
FROM sales_composition_ratio;
72. / 76
実践SparkSQL: ABC分析
• ロジック中にテストやアサーションを入れられる
72
// amountに0を含むデータ
val malMonthlySales = Seq(
Sales("Onion", 185000),
Sales("Tomato", 168000),
Sales("Green Onion", 0)
).toDS()
// amount = 0 となるレコードが存在しないことを確認するアサーション
assert(malMonthlySales.filter("amount = 0").count == 0, "amount must not be 0")
java.lang.AssertionError: assertion failed: Amount must not be 0
at scala.Predef$.assert(Predef.scala:170)
... 48 elided