SlideShare a Scribd company logo
ニコニコ動画データ
セットを検索可能に
してみよう
@PENGUINANA_
whoami
• @PENGUINANA_ / 兼山元太
• エンジニア at *.cookpad.com/*
• 検索インフラとサービス開発
身の回りのJSON
• tweet
• 140 character message
身の回りのJSON
• tweet
• 140 character message
• user_name
• datetime
• location
• reply or not/contains link or not/
retweeted count/reply count ...
身の回りのJSON
• access log
• ip address
• requested content
• status code
• response time
• referrer
身の回りのJSON
• event log
• user_id
• event name
• params(hash)
• datetime
• user agent
身の回りのJSON
• dictionary edit request
• keyword
• operation type
• requester
• status(applied or not)
kibana
• http://demo.kibana.org/
• http://www.elasticsearch.org/blog/kibana-
whats-cooking/
kibana@cookpad
• log dashboard for internal API
• explore log
• capacity planning
• performance check
• slowquery
dashboard for each application
テーマ
• データサイズに負けずにJSONデータを
柔軟に検索/分析できれば日常が楽にな
る
• どうやったらできる?難しい?
やってみればよい
• ニコニコ動画データセット
• 検索/分析可能にする
データセット
• ニコニコ動画公式データセット
• 800万動画のメタデータ
• 25億コメント
• JSON形式(圧縮:60G 非圧縮:300G)
http://goo.gl/FYtO5T
データセット
• ニコニコ動画公式データセット
• 800万動画のメタデータ
• 25億コメント
• JSON形式(圧縮:60G 非圧縮:300G)
http://goo.gl/FYtO5T
http://goo.gl/FYtO5T
http://goo.gl/FYtO5T
結果
• Elasticsearch on AWSで4時間でできた
• s3 -> unzip -> Elasticsearch(173k doc/s)
• 550円
デモ
• 25億のコメントをdate facet
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
install
• wget https://download.elasticsearch.org/
elasticsearch/elasticsearch/
elasticsearch-0.90.3.noarch.rpm
• sudo rpm -i elasticsearch-0.90.3.noarch.rpm
install plugins
• sudo bin/plugin
• .. -install elasticsearch/elasticsearch-cloud-aws
• .. -install mobz/elasticsearch-head
• .. -install lukas-vlcek/bigdesk
• .. -install elasticsearch/elasticsearch-analysis-kuromoji
elasticsearch-cloud-aws
• cluster node discovery in AWS
• add config to elasticsearch.yml
cloud:
aws:
access_key:AKI...........
secret_key: mR.............
discovery:
type: ec2
discovery.ec2.groups: es_test (security_group)
elasticsearch-head
bigdesk
elasticsearch-analysis-
kuromoji
• japanese analyzer
config
• # Set a custom allowed content length:
• http.max_content_length: 1000m
• # Heap Size (defaults to 256m min, 1g max)
• ES_HEAP_SIZE=3g
• # ElasticSearch data directory
• DATA_DIR=/media/ephemeral1/es,/media/ephemeral2/
es,/media/ephemeral3/es
make AMI
• elasticsearch machine image
launch ES Instances
• c1.xlarge x 20
• CPU Xeon 8core(2,300MHz)
• Memory 7G
• Disk 420G x4
• $0.07/hour(spot instance)
ニコニコ動画を検索可能にしてみよう
• download from s3 to nodes
• use s3cmd(few minutes with GNU Parallel)
• unzip(60GB->300GB)
deploy data
bulk import
{ "index" : { "_id" : "sm14784868 1", "parent": "sm14784868" } }
{"date":"2011-06-18T20:15:30+09:00","no":1,"vpos":
63,"comment":"1","command":"184"}
...
{ "index" : { "_id" : "sm14784868 2", "parent": "sm14784868" } }
{"date":"2011-07-24T02:22:58+09:00","no":2,"vpos":
4651,"comment":"2 get","command":"184"}
bulk import
• ls request_file* | parallel -j N curl -X POST -s -D - 'http://
localhost:9200/nico2/comment/_bulk' -o /dev/null --data-
binary @{}
ニコニコ動画を検索可能にしてみよう
wc -l requests
> 4.8billion
import... import...
import...
• all node can handle indexing request
• curl bulk import in each node (x20)
• I/O into 3 disks
• takes 4hours
efficiency
efficiency
"mappings": {
"video": {
"properties": {
"video_id": { "type": "string", "index": "no" },
"title": { "type": "string", "index": "analyzed" },
"description": { "type": "string", "index": "analyzed" },
"thumbnail_url": { "type": "string", "index": "no", "store": "yes" },
"upload_time": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" },
"movie_type": { "type": "string", "index": "not_analyzed" },
"last_res_body": { "type": "string", "index": "analyzed" },
"tags": {
"properties": {
"tag": { "type": "string", "index": "not_analyzed" }
}
}
}
}
efficiency
"mappings": {
"comment": {
"_parent": { "type": "video" },
"properties": {
"date": { "type": "date", "format": "YYYY-MM-dd'T'HH:mm:ss'+09:00'" },
"no": { "type": "integer" },
"vpos": { "type": "integer" },
"comment": { "type": "string" },
"command": { "type": "string" },
"video_id": { "type": "string", "index": "not_analyzed" }
}
}
efficiency
• curl -X POST 'http://localhost:9200/nico2' -d
@mapping.json
shrink
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [ {
"move" :
{
"index" : "nico2", "shard" :33,
"from_node" : "nodeA", "to_node" : "nodeB"
}
}
]}
'
shrink
curl -XPUT localhost:9200/_cluster/settings -d
'{ "persistent": {
"indices.recovery.concurrent_streams": 3
}}'
curl -XPUT localhost:9200/_cluster/settings -d
'{ "persistent": {
"indices.recovery.max_bytes_per_sec": "1000mb"
}}'
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
ニコニコ動画を検索可能にしてみよう
Why Elasticsearch?
• proven scalable search engine
• super flexible config with nice default conf
• Great API
• growing developer, user base
not covered
• mapping
• query DSL
• search performance
• cluster operation
• healthcheck / cluster statistics
• etc...
questions?

More Related Content

ニコニコ動画を検索可能にしてみよう