SIG-KBS(çŸ¥è˜ãƒ™ãƒ¼ã‚¹ã‚·ã‚¹ãƒ†ãƒ ç ”ç©¶ä¼š)ã®Google Marketingã«ãŠã‘ã‚‹ã‚³ãƒ³ãƒ”ãƒ¥ãƒ¼ã‚¿ãƒ¼ã‚µã‚¤ã‚¨ãƒ³ã‚¹ã¨çµ±è¨ˆå¦ã®è¬›æ¼”ãƒ¡ãƒ¢

SIG-KBSã§Googleã®è¬›æ¼”ãŒã‚ã‚Šã€ãã‚Œã®ãƒ¡ãƒ¢ã‚’å–ã£ã¦ãŠã„ãŸã®ã§è¼‰ã›ã¦ãŠãã¾ã™ã€‚Googleé–¢é€£ã®è¬›æ¼”ã¯ã“ã‚Œã¾ã§ä¸‰å›žãã‚‰ã„èžã„ãŸã“ã¨ãŒã‚ã‚‹ã‚“ã§ã™ã‘ã©ã€çµ±è¨ˆé–¢é€£ã¯èžã„ãŸã“ã¨ãŒç„¡ã‹ã£ãŸã®ã§éžå¸¸ã«é¢ç™½ã‹ã£ãŸã§ã™ãã€‚

Ustã¨ã‹ã‚¹ãƒ©ã‚¤ãƒ‰æ’®ã£ã¦ãã‚Œã‚’webã«æŒ™ã’ã‚‹ã£ã¦ã®ã¯å‹˜å¼ã—ã¦ãã ã•ã„><ã¨ã„ã†æ„Ÿã˜ã ã£ãŸã®ã§ã™ãŒã€Blogã«ã™ã‚‹ãã‚‰ã„ãªã‚‰è‰¯ã„ã‚‰ã—ã„ã®ã§ã€è¼‰ã›ã¦ãŠãã¾ã™ã€‚ãƒ¡ãƒ¢ãã®ã¾ã‚“ã¾ãªã‚“ã§ç‰¹ã«è£œè¶³èª¬æ˜Žã¨ã‹ãªã„ã§ã™ã‘ã©ã€‚ã€‚

ç¤¾å†…å°‚ç”¨æœ€é©åŒ–è¨€èªžã¨ã‹ã€çµ±è¨ˆè§£æžã«ä½¿ã£ã¦ã„ã‚‹çµ±è¨ˆãƒ¢ãƒ‡ãƒ«ã¨ã‹ã®èª¬æ˜Žã¯é¢ç™½ã‹ã£ãŸã§ã™ãã€‚

CSã¨çµ±è¨ˆå¦

Computer Science and Statistics at Google Marketing

æ±å¤§å’æ¥å¾ŒGoogle

2009 Quantitative Marketing Manager @ Tokyo

PhD in Engineering from the University of Tokyo (focused on computational science)

modeling, computational and simulation analysis of complex networks

Visualization of large-scale complex graphs!

ä¸€äººã—ã‹ã„ãªã„

Member

Web search

çµ±è¨ˆ

1 Let others speak for you
2,3 é‡è¦
2 Data not hype
3 Results must be trackable
4 Promote trial
5 YOu're smart and your time matters.
6 We're serious. Except when we're not.
7 Big ideas move us.

Seven principles of Google Marketing

ç¤¾å†…ã®å…¨ã¦ã®ç¤¾å“¡ãŒçµæ§‹ãƒ‡ãƒ¼ã‚¿ã®ã‚¢ã‚¯ã‚»ã‚¹ãŒã§ãã‚‹ã‹ã¨ã„ã†ã¨é•ã†

ã‚¤ãƒ³ãƒˆãƒ©ãƒãƒƒãƒˆ æ¤œç´¢ã‚¨ãƒ³ã‚¸ãƒ³

ã‚½ãƒ¼ã‚¹ã‚³ãƒ¼ãƒ‰ã®ã‚¢ã‚¯ã‚»ã‚¹ã¯ã‚½ãƒ•ãƒˆã‚¨ãƒ³ã‚¸ãƒ‹ã‚¢ã®ã¿ã ã‘ã©çµ±è¨ˆé–¢é€£ã®äººã‚‚è¦‹ã‚Œã‚‹ã¨ã‹

åˆ¶é™åŽ³ã—ã„

ã‚¨ãƒ³ã‚¸ãƒ‹ã‚¢ã§ãªã„äººãŒlog ãƒ‡ãƒ¼ã‚¿ã‚’ã©ã†è¦‹ã‚Œã°ã„ã„ã®ã‹
ãƒ¬ãƒãƒ¼ãƒ†ã‚£ãƒ³ã‚°ã®ãŸã‚ã®ãƒ„ãƒ¼ãƒ«

analysis skills
engineering skills
product and market-specific knowledge and expertise
extensive analytical and statistical skills.

analyses to help inform marketing strategies for key products

We have the same data/logs access privileges as software engineers

We are supposed to be data analysis professionals

äºŒã¤ã®äº‹ä¾‹

Display Ad Expertiments

100ã‹ã‚‰120ã«ä¸ŠãŒã‚‹
- ã“ã‚Œã ã‘ã§ã¯åˆ†ã‹ã‚‰ãªã„

ã“ã†ã‹

Test & control ads
- å‹•ç”»ã®åºƒå‘ŠåŠ¹æžœã®æ¸¬å®šã¯é›£ã—ã„

æ¯”è¼ƒå®Ÿé¨“

Media Mix

çµ±è¨ˆãƒ¢ãƒ‡ãƒ«

how does data analysis work at Google

Data analysis Visualization(R, SQL, visualization SQL)
Python
SQL
çµ±è¨ˆãƒ¢ãƒ‡ãƒ«

Google Technology
MapReduce
Google FIle System
Bigtable
Visualization API

Program Language
- Sawzall
- Python
- Javascript

Statistical Analysis
- R
- SQL

ã‚„ã£ã¦ã‚‹äº‹ã¯å¤§å¦ã®ç ”ç©¶ã¨ä¼¼ã¦ã„ã‚‹

Rã®ãƒ©ã‚¤ãƒ–ãƒ©ãƒªã‚’ä½œã£ã¦ã„ã‚‹äººãŒGoogleç¤¾å†…ã«çµæ§‹ã„ã‚‹ã€‚

Data analysis procedure

datapull from various logs
datapull from other data sources
aggreage and process
statistical analysis -- apply statistical models on the data
visualize and publish as presentation or report

Logs(70%)

access log
client download/update log

we dont' use rdbms at this stage

simply data is too huge
requires distributed computing with many machines
ofen no complex data manipulation is needed

Goal of data analysis is ofen rather simple
- sum histogram max min topN filtering

è§£æžæ‰‹æ³•ã¯ã‚·ãƒ³ãƒ—ãƒ«ã§ã‚ã£ãŸã‚Šã™ã‚‹

ãƒ†ãƒ©ãƒã‚¤ãƒˆã¯ä¸€æ°—ã«è¡Œã

RDBMSã¯å¿…è¦ãªã„
- é›£ã—ã„ã®ã¯join
- ãã†ã„ã†äº‹ã‚’ã™ã‚‹å¿…è¦ã¯ãªã„

logã®æ§‹é€
- request URL

http://code.google.com/p/protocolbuffers

MapReduceã‚’ç°¡å˜ã«ã¤ã‹ã†ãŸã‚ã«Sawzallã¨ã„ã†ãƒ—ãƒã‚°ãƒ©ãƒŸãƒ³ã‚°è¨€èªžã‚’ä½¿ã£ã¦ã„ã‚‹ã€‚

Sawzall

Query Geo Distribution

datum: table summ[t: time][lat :int][lon: int] of int:proto "querylog.proto"

log_record:QueryLogProto

ä¸€æ—¥ã§ç¿’å¾—å‡ºæ¥ã‚‹è¨€èªž
MapReduceã®å‡¦ç†ã‚’éš è”½ã—ã¦æ°—æ¥½ã«ä½¿ãˆã‚‹ã€‚

ç¤¾å†…æœ€é©åŒ–æ¿€ã—ã„

90æ•°%

é€Ÿã„

Failure-obvious
- discard and re-calcuatet the record with error rather than stall whole computation.

MySQL like database negine

Csv//text file in GFS
BigTable
local MySQL

Need to parse and aggreage different dat sources
- usually write Python script
- sometimes use local MySQL database for aggragation.

çµ±è¨ˆãƒ¢ãƒ‡ãƒ«ã‚’ä½œã‚‹

Apply appropriage statistical methods for given problems Some examples

Time-series(seasonal ARIMA) model
LIME mixed effects(LME)
Random forest models
DhD propensity scoring
Experimental design

20+ statisticans and quant analysts on the team.

R mostly commonly used

æ™‚ç³»åˆ—åˆ†æž!!!

Visualization and Presentation

æ™‚ç³»åˆ—ç›¸é–¢ã®ã‚¢ãƒ‹ãƒ¡ãƒ¼ã‚·ãƒ§ãƒ³

Adhering to Engineering standards

Sharing al source codes with all other software engineers
check code into single repository for the whole company
- your code may be used or edited by someone in the future
all codes have to follow coding styles
all codes have to be reviewed by peers before check in

Sharing computing resources with all other engineers.
- K distributed machines.
same infrastructure as production.

ä½¿ã„æ¨ã¦ã‚³ãƒ¼ãƒ‰ã‚‚ãƒ¬ãƒ“ãƒ¥ãƒ¼ã•ã‚Œã‚‹

å®ˆå‚™ç¯„å›²ãŒåºƒã„

Youtubeã®Traficã‚’ãªã‚“ã‹ã«ä½¿ãˆãªã„ã‹

Challenges we are facing

Complex questions without simple solutions

Large volumes of data
- can't achieve w/o sophosticated computing infrastructure
- analysts need to have necessary technical / engineering and quantitative skills

Limited resources (hiring & training)

Privacy

ç¤¾å†…å‘ã‘çµ±è¨ˆå¦è€…ã¯å…¨ç„¶ã„ãªã„ã€‚

CS + Statistical + math backgraound is difficult.

çµ±è¨ˆã®æ•™ç§‘æ›¸ã‚’å‡ºã—ã¦å‹‰å¼·ã—ã¦ã„ã‚‹ã€‚

Statisticså°‚æ”»ã¯æ—¥æœ¬ã«ã¯ã‚ã¾ã‚Šã„ãªã„ã€‚

Google has many data analyst teams, including us QM

We are NOT software engineers but are equipped with either engineering or statistics backgrounds and adhere to engineering standards at Google

We undertake complex research and modeling projects that involve large-scale data processing and intensive statistical analysis.

We are hoping

Datacenters
MapReduce
Sawzall
other Google technologies(GFS, Bigtable)

google realted papers
http://labs.google.com/papers

QL also presents some papers at JSM(Joint Statistical Meetings) conference
http://www.amstat.org/meetings/jsm/2010/

æ‰‹ã«å…¥ã‚‹ã‚‚ã®ã¯ä½•ã§ã‚‚ä½¿ã†æ–¹å‘
low dataãŒã„ã„
é‡‘é¡ã¨ã®å…¼ãåˆã„