Hive/Pigを使ったKDD'12 track2の広告クリック率予測

Hive/Pigを使ったKDD'12 track2
の広告クリック率予測
油井誠 m.yui@aist.go.jp
産業技術総合研究所情報技術研究部門
Twitter ID: @myui
スライド
http://www.slideshare.net/myui/dsirnlp-myuilt
1
http://goo.gl/Ulf3A

KDDcup 2012 track2
• 検索ログを基に、検索エンジンの広告のクリック
率(Click-Through Rate)を推定するタスク
– 中国の3大検索エンジンの一つsoso.comの実データ
• 検索語などはHash値などを利用してすべて数値化されてい
る
– Trainingデータ(約10GB+2.2GB, 15億レコード）
– Testデータ（約1.3GB, 2億レコード）
• 学習データの1.33割が評価用データセット
– CTRがsubmission format
• クラス分類というより回帰（もちろんクラス分類でも解ける）

2

学習データのテーブル構成
User table UserID Gender Age
Query table
QueryID Tokens

Click = Positive
UserID AdID QueryID Depth Position Impression Click Impression – Click = Negative
CTR = Click / Impression
Training table AdID properties

DisplayURL AdvertiserID KeywordID TitleID DescriptionID

KeywordID Tokens TitleID Tokens DescriptionID Tokens

Keyword table Title table Description table
評価用のテーブルにはimpression、click以外の素性(feature)
基本的に、全部、質的変数 → 二値変数の素性に分解

Label A B Label A:1 A:2 A:3 B:7 B:8 B:9
1 1 9 1 1 0 0 0 1 0
-1 2 7 -1 0 1 0 0 0 1
1 3 8 1 0 0 1 1 0 0
3

ロジスティック回帰での発生予測
• 発生確率を予測する手法
• 各変数の影響力の強さを計算(Train)
– 入力: Label, Array<feature>
– 出力: 素性ごとの重みのMap<feature, float>
– # of features = 54,686,452
• ただし、token tableは利用していない (Token ID = <token,..,token>)
• 影響力を基に生起確率を計算(Predict)
– P(X) = Pr(Y=1|x1,x2,..,xn)
– f: X → Yとなる関数fを導出したい 1
𝑛

s.t. empirical lossを最小化 𝑎𝑟𝑔𝑚𝑖𝑛 𝑙𝑜𝑠𝑠(𝑓(𝑥 𝑖 ; 𝑤), 𝑦 𝑖 )
𝑛
𝑖=0
• 勾配降下法を使う
各素性の重み 4

Gradient Descent(勾配降下法)

学習率

𝑛
1
𝑤 𝑡+1 = 𝑤 𝑡 − 𝛾 𝑡 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥 𝑖 ; 𝑤 𝑡 ), 𝑦)
𝑛
𝑖=0

経験損失の勾配を基に重みを更新
新しい重み古い重み

Jimmy LinのLarge-Scale Machine Learning at Twitterより
https://speakerdeck.com/u/lintool/p/large-scale-machine-learning-at-twitter 5

勾配の並列計算
𝑛
1
𝑛
𝑖=0

勾配をmapperで並列に計算 mappers
重みの更新をreducerで行う
single reducer

• 実際には重みの更新の時に更新されたfeature(xi)が必要
• wはMap<feature, weight>でMap.size()=54,686,452
• Iteration数が多く必要で、入出力がDFSを介すMapReduce
に向かない
• Reducerでの計算がボトルネックになる
6

確率的勾配降下法
• Gradient Descent
𝑛
1
𝑛
𝑖=0
モデルの更新に全てのトレーニングインスタンスが必要(バッチ学習）

• Stochastic Gradient Descent (SGD)
𝑤 𝑡+1 = 𝑤 𝑡 − 𝛾 𝑡 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥; 𝑤 𝑡 ), 𝑦)
それぞれのトレーニングインスタンスで重みを更新(オンライン学習）

– Iterative Parameter Mixで処理すれば、実際意外とうまく
動くし、そんなにイテレーション数が必要でない
𝑡+1
• データ分割して、各mapperで並列に 𝑤 = 𝑤 𝑡 − 𝛾 𝑡 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥; 𝑤 𝑡 ), 𝑦) を計算
• モデルパラメタはイテレーション/epochごとに配る

7

よくある機械学習のデータフロー
Testデータ

array<feature>

Trainingデータ Modelデータ

Label, Map predict
array<feature> <feature, weight>
train

Label/Prob

8

よくある並列trainのデータフロー
map Map
<feature,weight>

Trainingデータ map Modelデータ
Label, Map
reduce
array<feature> map <feature, weight>
重みの平均をとる
map イテレーションする場合は
𝑤 𝑡+1 = 𝑤 𝑡 − 𝛾 𝑡 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥; 𝑤 𝑡 ), 𝑦) 古いmodelを渡す
SGDで重みを計算

機械学習はaggregationの問題
直感的にはHive/PigのUDAF(user defined aggregation
function)で実装すればよい 9
ほんとはM/Rよりもparallel aggregationに特化したDremelに向いてる

よくある並列trainのデータフロー
map Map
<feature,weight>

Trainingデータ map Modelデータ
Label, Map
reduce
array<feature> map <feature, weight>
重みの平均をとる
map イテレーションする場合は
𝑤 𝑡+1 = 𝑤 𝑡 − 𝛾 𝑡 𝛻𝑙𝑜𝑠𝑠(𝑓(𝑥; 𝑤 𝑡 ), 𝑦) 古いmodelを渡す
SGDで重みを計算

最初は素直にmapを返すUDAFで作った
create table model as
select trainLogisticUDAF(features,label [, params]) as weight from training
mapはsplitサイズの調整でメモリ内に収まるけど、より規模がでかくなると
10
reduceでメモリ不足になるのでデータ量に対してスケールしない

Think relational
Testデータ

array<feature>

Trainingデータ Modelデータ

Label, Map predict
array<feature> <feature, weight>
train

Scaler値として返すのはダメ
Label/Prob
リレーションでfeature, weightを返そう
でも、UDAFは使えない
→そこでUDTF (User Defined Table Function)
11

UDTF (parameter-mix)
HadoopのInputSplitSizeの設定に応じたmapperが
select 立ち上がる（map-only)
feature,
CAST(avg(weight) as FLOAT) as weight
from
( select
TrainLogisticSgdUDTF(features,label,..) as (feature,weight)
from train
)t
group by feature;

どうやってiterative parameter mixさせよう？？？
古いmodelを渡さないといけない
毎行渡すのはあれだし…
12

UDTF(iterative parameter mix)
create table model1sgditor2 as
select
feature,
CAST(avg(weight) as FLOAT) as weight
from (
select
TrainLogisticIterUDTF(t.features, w.wlist, t.label, ..)
as (feature, weight)
from
training t join feature_weight w on (t.rowid = w.rowid)
)t
group by feature;
ここで必要なのは、各行の素性ごとに古いModel
Map<feature, weight>, label相当を渡せばよいので、
Array<feature>に対応するArray<weight>をテーブルを作って
inner joinで渡す
13

Pig版のフローの一例
training_raw = load '$TARGET' as (clicks: int, impression: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int,
titleid: int, descriptionid: int, userid: int, gender: int, age: int);

training_bin = foreach training_raw generate flatten(predictor.ctr.BinSplit(clicks, impression)), displayid, adid, advertiserid, depth, position, queryid,
keywordid, titleid, descriptionid, userid, gender, age;
training_smp = sample training_bin 0.1;

training_rnd = foreach training_smp generate (int)(RANDOM() * 100) as dataid, TOTUPLE(*) as training;
training_dat = group training_rnd by dataid;

model = foreach training_dat generate predictor.ctr.TrainLinear(training_rnd.training.training_smp);

store model into '$MODEL';

model = load '$MODEL' as (mdl: map[]);
弱学習
model_lmt = limit model 10;

testing_raw = load '$TARGET' as (dataid: int, displayid: int, adid: int, advertiserid: int, depth: int, position: int, queryid: int, keywordid: int, titleid: int,
descriptionid: int, userid: int, gender: int, age: int);

testing_with_model = cross model_lmt, testing_raw;

result = foreach testing_with_model generate dataid, predictor.ctr.Pred(mdl, displayid, adid, advertiserid, depth, position, queryid, keywordid, titleid,
descriptionid, userid, gender, age) as ctr;

result_grp = group result by dataid;
result_ens = foreach result_grp generate group as dataid, predictor.ctr.Ensemble(result.ctr);
result_ens_ord = order result_ens by dataid;
result_fin = foreach result_ens_ord generate $1;
store result_fin into '$RESULT';
アンサンブル学習
14

まとめ
• データ量に対してちゃんとスケールするものができた
– インターン生にpig版を作ってもらった
• こちらはUTDFではやっていなくて、モデルファイルを分割して作っ
て、アンサンブル学習させる戦略
– オンラインのモデル更新とかをやるには、updateのない
hiveだとinsertにしないといけないので一工夫いる
– Passive-aggressive版も作る予定
• 現状、AUC=0.75程度（優勝者の台湾国立大は0.8）
– a9aデータセットだとlibsvm, svm-light, liblinear, tinysvmな
どと同程度の精度(0.85ぐらい)
• 余裕があったらHiveにパッチとして送る
– でも、ドキュメントとかテストとかｘｘｘｘｘ

実データを持つ共同研究先募集
(一件、広告配信企業とやってる） 15

Hive/Pigを使ったKDD'12 track2の広告クリック率予測

More Related Content

Hive/Pigを使ったKDD'12 track2の広告クリック率予測