HBase at LINE

HBase at LINE
~ How to grow our storage together with service ~

中村俊介, Shunsuke Nakamura
(LINE, twitter, facebook: sunsuk7tp)

NHN Japan Corp.

自己紹介
中村俊介

• 2011.10 旧 Japan新卒入社 (2012.1から Japan)

• LINE server engineer, storage team

• Master of Science@東工大首藤研

• Distributed Processing, Cloud Storage, and NoSQL

• MyCassandra [CASSANDRA-2995]: A modular NoSQL with
Pluggable Storage Engine based on Cassandra

• はてなインターン/インフラアルバイト

NHN/NAVER
• NAVER Korea: 検索ポータルサイト • NHN = Next Human Network
• NAVER
• 韓国本社の検索シェア7割
• Hangame
• 元Samsungの社内ベンチャー • livedoor
• NAVER = Navigate + er • データホテル
• NAVER Japan • NHN ST
• Japanは今年で３年目 • JLISTING 韓国本社

• メディエータ Green Factory

• 経営統合によりNAVERはサービス
名、グループ、宗教 • 深紅
• LINE、まとめ、画像検索、NDrive

8.17 5,500万users (日本 2,500万users)
AppStore Ranking - Top 1
Japan,Taiwan,Thailand,, HongKong, Saudi, Malaysia, Bahrain, Jordan, Qatar,
Singapore, Indonesia, Kazakhstan, Kuwait, Israel, Macau, Ukraine, UAE,
Switzerland, Australia,Turkey,Vietnam, Germany, Russian

LINE Roadmap
2011.6 iPhone first release

Android first release 2011.8
WAP
I join to LINE team. 2011.10

Sticker VOIP
Bots (News, Auto-translation, Public account, Gurin)

PC (Win/Mac), Multi device Sticker Shop

LINE Card/Camera/Brush
WP first release 2012.6
Timeline
BB first release 2012.8

LINE platform

Target of LINE Storage
start
1. Performing well (put < 1ms, get < 5ms)
2. A high scalable, available, and eventually
consistent storage system built around NoSQL
3. Geological distribution

future
Global Japan
56.8% 43.2%

LINE Storage and NoSQL
1. Performing well

consistent storage system

3. Geological distribution

At ﬁrst,

LINE launched with Redis.

Initial LINE Storage
• Target: 1M DL within 2011
• Client-side sharding with a few Redis nodes
• Redis
• Performance: O(1) lookup
• Availability: Async snapshot + slave nodes (+ backup MySQL)
• Capacity: memory + Virtual Memory
node
queue
app dispatcher
server queue
... ...
master

storage backup

slave

August
28~29 2011
Kuwait
Saudi Arabia
Qatar
Bahrain…

1Million over
2011

• Sharded Redis
• Shard coordination: ZKs + Manager Daemons
(shard allocation using consistent hashing, health check&failover)

x 3 or 5

x3

However, in fact...
100M DL within 2012

Billions of Messages/Day...

We had encountered so much problems every day
in 2011.10...

Redis is
NOT easily scalable
NOT persistent

And,

easily dies

consistent storage system built around
NoSQL

2011年内1Mユーザーを想定したストレージを、
サービス無停止で2012年内1Bユーザーに対応する

Zuckerberg’s Law of Sharing
(2011. July.7)

Y = C * 2 ^ X (Y: sharing data, X: time, C: constance)
Sharing activity will double each year.

LINEのmessage数/月は
いくら？

10億x30 = 300億
messages/month

Data and Scalability
• constant
• DATA: async operation

• SCALE: thousands ~ millions per each queue

• linear
• DATA: users’ contact and group
10000000000

• SCALE: millions ~ billions 7500000000

• exponential 5000000000

• DATA: message and message inbox 2500000000

• SCALE: tens of billion ~ 0

constant linear exponential

Data and Workload
Queue
• constant
• FIFO

• read&write fast
Zipﬁan curve

• linear
• zipf.

• read fast [w3~5 : r95] Message timeline

• exponential
• latest

• write fast [w50 : r50]

Choosing Storage
• constant: Redis
• linear, exponential: 選択肢幾つか

• HBase
• ⃝ workload, NoSQL on DFSで運用しやすい (DFSスペシャリスト++)

• × SPOF, Random Readの99%ile性能がやや低い

• Cassandra
• ⃝ workload, No SPOF (No Coordinator, rack/DC-aware replication)

• × Weak consistencyに伴う運用コスト, 実装が複雑 (特にCAS操作)

• MongoDB
• ⃝ 便利機能 (auto-sharding/failover, various query) → 解析向けで不要

• × workload, 帯域やディスクの使い方悪い

• MySQL Cluster
• ⃝ 使い慣れ (1サービス当たり最大数千台弱運用)

• × 最初から分散設計でwrite scalableものを使うべき

HBase
• 数百TBを格納可能

• 大量データに対してwrite scalable, 効率的なrandom access

• Semi-structured model (< MongoDB, Redis)
• RDBMSの高級機能はもたない (TX, joins)

• Strong consistency per a row and columnfamily
• NoSQL constructed on DFS
• レプリカ管理不要 / Region移動が楽

• Multi-partition allocation per RS
• ad hocなload balancing

LINE Storage (2012.3)
iPhone Android WAP x 25 Million

Thrift API / Authentication / Renderer

app. server (nginx) app. server (nginx) app. server (nginx)

Redis Queue Redis Queue Redis Queue x 100 nodes
async operation
failed operation
dispatcher dispatcher dispatcher

Sharded Redis clusters (message, contact, group) x 400 nodes

Contact HBase Message HBase
Backup MySQL
HDFS x 100 nodes x2 nodes backup operation

LINE Storage (2012.7)
phone (iPhone/Android/WP/blackberry/WAP) PC (win/mac) x 50 Million

Thrift API / Authentication / Renderer

app. server (nginx) app. server (nginx) app. server (nginx)

Redis Queue Redis Queue Redis Queue x 200 nodes
async operation
failed operation
dispatcher dispatcher dispatcher

Sharded Redis clusters (message, contact, group) x 600 nodes

Msg HBase01 Primary HBase Msg HBase02
Backup
MySQL
HDFS01 HDFS02 x2 nodes backup operation

x 200 nodes

2012.3 → 2012.7
• ユーザー数2倍、インフラ2倍

• まだHBaseにとってCasual Data

• Message HBaseはdual cluster

• message TTLに応じて切り替え (TTL: 2week → 3week)

• HDFS DNはHBase用のM/Rとしても利用

• Sharded-Redisがまだ基本プライマリ (400→600)

• messageはHBaseにもget

• 他はmodelのみをbackup

LINE Data on HBase
• LINE data userId User Model
• MODEL: <key> → <model> email phone

• INDEX: <key> <property in model>

INDEX
• User: <userId> → <User obj>, <userId> <phone>

• 各modelを1つのrowで表現

• HBaseのconsistency: 1つのrow, columnFamily単位でstrong consistencyを保証

• contactなどの複数modelをもつものはqualiﬁer (column)を利用

• レンジクエリが必要なDataは一つのrowにまとめる (e.g. message Inbox)

• Cons.) column数に対してリニアにlatency大 → delete, search ﬁlter with timestamp

timestamp, version
• Column level timestamp
• modelのtimestampでindexを構築

• API実行timestampでasync, failure handling

• Search ﬁlterとしても利用 (Cons. TTLの利用不可)

• Multiple versioning
• 複数emailのbinding (e.g. Google account password history)

• CSの為のdata trace

Primary key for LINE
• Long Type keyを元に生成: e.g. userId, messageId

• simple ^ random for single lookup
• range queryのためのlocalityの考慮不要

• prefix(key) + key

• prefix(key): ALPHABETS[key%26] + key%1000

• o.a.h.hbase.util.RegionSplitter.SplitAlgorithmを実装

• prefixでRegion splitting

a HRegion 2600
2626
2756
2782 b 2601
c 2602
d z
2652 2808

a000 a250 a500 a750 b000

Data stored in HBase
• User, Contact, Group
• linear scale

• mutable

• Message, Inbox
• exponential scale

• immutable

Message, Inbox
performance, availability重視

• Sharded-Redisとのhybrid構成
• 片方から読み書きできればOK (< quorum)
• failed queryはJVM Heap,Redisにqueuing&retry
• immutable&idempotent query: 整合性, 重複の問題なし

User, Contact
performanceよりconsistency重視

• Sharded-Redisがまだprimary
• scalabilityの問題はない

• mutableなので整合性重要

• RedisからHBaseへ移行 (途中)
• Model Objectのみbackup

RedisからHBaseへ移行
1. modelのbackup

• Redisにsync、HBaseにasync write (Java Future, Redis queuing)

2. M/Rを使ってSharded-Redisからfull migration

3. modelを元にindex/inverted index building (eventual) ←イマココ

• Batch Operation: w/ M/R, model table full-scan using
TableMapper
• Incremental Operation: Diff logging and sequential indexing or
Percolator, HBase Coprocessor
4. access path切り替え, Redis cache化

HBaseに置き換えたら
幸せになれた？

ある意味ではYES

• Scalability Issuesが解決
• 今年いっぱいまでは
• 広域分散 → 3rd issue (To be continue...)

HBaseを8ヶ月運用してみた印象

• HBaseは火山
• 毎日小爆発

• 蓄積してたまに大爆発

• 火山のふもとでの安全な暮らし

爆発
• 断続的なネットワーク障害によるRS退役

• H/W障害によるDN性能悪化・検知の遅延

• get (get, increment, checkAndPut, checkAndDelete)性能劣化、
それに伴う全体性能低下

• (major) compactionによる性能劣化

• データ不整合

• SPOF絡みの問題はまだ起こってない

HBaseのAvailability

• SPOF or 死ぬとdowntimeが発生する箇
所が幾つか

1. HDFSのNameNode

2. HBaseのRegionServer, DataNode

1. HDFS NameNode (NN)

• HA Framework for HDFS NN (HDFS-1623)
• Backup NN (0.21)

• Avatar NN (Facebook)

• HA NN using Linux HA

• Active/passive conﬁguration deploying two NN (cloudera)

HA NN using Linux-HA

• DRBD+heartbeatで冗長化
• DRBD: disk mirroring (RAID1-like)

• heartbeat: network monitoring

• pacemaker: resource management (failover logicの登録)

2. RegionServer, DataNode
• HBase自体がレプリカをもたない

• failoverされるまでdowntime発生

• 複数コンポーネントで構成されているので、
故障検知から全体合意まで、それぞれの通信区間でtimeoutを待た
なければいけない

downtime対策
• HBase自身がreplicaを持たないのでRS死亡時のdowntimeが必ず発生

• distributed HLog splitting (>=cdh3u3)
• timeout&retry
• ほとんどHClient RS間のtimeout時間

• timeout調整 (retryごと, operationごと)

• RS ZK間は短いとnetworkが不安定なときにRSが排除されやすい

• 同じkeyを持つregionを同じRSに配置 → 障害の限定化

• LINEのHBase accessは基本的にasync

• Cluster replication

HBase cluster replication
• Cluster Replication: master push方式

• (MySQLのようなbinary logging mechanism), 馬本8.8章参照

• 非同期でWAL (HLog)をslave clusterにpush

• 各RSごとにSynchronous Call

• syncされていないWAL ListをZKが管理

• 検証しつつも、
• 独自実装 or 他の手段も考慮中
multi-DC間のreplication向けではない

HDFS tuning for HBase

• Shortcut a local client reads to a Datanodes
ﬁles directly > 0.23.1, 0.22.1, 1.0.0 (HDFS-2246)
• Rack-aware Replica Placement (HADOOP-692)

削除問題
• 削除が少し低速

• 論理削除なのでgetほどではないが、putの2倍かかる

• 例) 1万件のコンタクトをもつユーザー退会処理

• カラム多すぎでクライアント側でtimeout → queuing + iterative delete

• 例) TTLが過ぎたmessage削除

• cold dataに対するRandom I/Oが発生し、serviceに影響

• → dual cluster, full-truncate or TTL利用

• 例) スパマー対応

• compactionされるまでのget性能 (大量のskip処理)への影響

• → column単位ではなく、row単位の削除に

Compaction対策

• Bigtable: I/O最適化と削除の為に定期的なCompaction処理が必要

• RSごとにQueuingされ同時に1 HRegionずつCompactionが実行される

• Compaction実行中にCPU利用率が上がるので、タイミング注意

• タイミング: periodic, StoreFile数, ユーザー実行

• peak-time時に連続して発生しないよう、
off-peakにcompactionとregion splitting

Balancing, Splitting, and Compaction
• Region balancing

• 自動balancer (request数ベースのbalancing)はOFF

• serviceのoﬀ-peak時にbalancing

• 異なるtableの同一keyは同じserverに割当→障害を限定

• 問題のあるRegion専用のserver: prison RS

• Region splittingとcompactionのスケジューリング

• 自動splitもなるべく避ける (hbase.hregion.max.ﬁlesizeで自動split)

• 連続的なmajor compactionを避ける

• immutable storageはperiodic compactionをOFF

HBase Tools
• Client:
• HBaseTemplate: HBase Wrapper like spring’s RedisTemplate
• MirroringHTable: 複数HBase cluster対応

• 運用監視:

• auto splitting: off-peak時のregion split

• auto HRegion balancer: metricsを元にoff-peak balancing

• Region snapshot&restore: META Tableをdaily dump、RS死亡時の復元

• Data Migration:
• Migrator with M/R (Redis → HBase)
• H2H copy tool with M/R: table copy (HBase → HBase)
• metrics collecting via JMX
• Index Builder and Inconsistent Fixer with M/R, incremental implementation (coprocessor)

今後の課題
• HBase上の<key, model>を中心にindexやRedis上にcacheを構築

• 停電・地震対策 (rack/dc-awareness)

• HBase cluster replication

• Cassandraをgeological distributed storage for HLogとして利用

• 今以上のスケーラビリティ (数 - 数十億ユーザー)

• HBaseはnetwork-boundで1クラスタ数百台弱が限界

• Multi-clusterで凌ぐかCassandraを使うか

HBase at LINE

More Related Content

HBase at LINE