A Clever Way to Scale-out a Web Application

A Clever Way to Scale-out
a Web Application
Cybozu Labs, Inc.
Kazuho Oku

RDB sharding

 denormalization is inevitable

uid:1-2000
uid:2001-4000
uid:4001-6000

tweet
tweet
tweet

following
following
following
...
followed_by
followed_by
followed_by

timeline
timeline
timeline

when uid:123 tweets, write his tweet, read uids of his followers, and
update the timeline table of his followers

Sep 11 2009 A Clever Way to Scale-out a Web Application 2

Two methods to update the shards

 eventual consistency
 asynchonous updates using worker processes
 pros: fast response, high scalability
 cons: hard to maintain
 2-phase commit
 synchronous updates
 pros: synchronous, doesn't require external
daemon
 cons: slow response

The problems

 complex queries
 reading from / writing to multiple DB nodes
 cannot use secondary indexes
 need to maintain per-user views (denormalized tables)

 maintain consistency between the nodes
 when using eventual consistency model
 dynamic scaling
 adding new nodes without stopping the service


Incline


Incline

 solution for the two problems of
eventual consistency:
 complex update queries
 maintenance of the denormalized tables
 basic idea
 do not let app. developers write denormalization
logic
 handle denormalization below the SQL layer
 by using triggers and queue tables


Incline – illustrated

 insert / update / delete rows of related
tables automatically
uid:1-2000
uid:2001-4000
uid:4001-6000

tweet
tweet
tweet

following
following
following

followed_by
followed_by
followed_by
...
timeline
timeline
timeline

queue
queue
queue

when uid:123 tweets, write only to his tweet table. Incline updates
other tables automatically


Incline – illustrated (cont'd)

 insert / update / delete rows of related
tables automatically
uid:1-2000
uid:2001-4000
uid:4001-6000

tweet
tweet
tweet

following
following
following

followed_by
followed_by
followed_by
...
timeline
timeline
timeline

queue
queue
queue

when uid:2431 starts following uid:940 only write to his following table


Incline – details

 triggers generated from def. files
 sync. updates within each node
 async. updates between the nodes
 each DB node has a queue table
 helper program (C++) applies the queued events
to other nodes
 uses a fault tolerant algorithm
 application only needs to write to the
user's shard

Incline – the commands
# create queue tables
% incline --mode=shard --rdbms=mysql --database=microblog
--host=10.0.200.10 --source=microblog.json --shard-source=shard.json
create-queue

# create triggers
create-trigger

# run forwarder (transfers data from specified host to other shards)
forward


Incline – the definition files
# view def. file
# shard def. file
[
{
{
"algorithm" : "range-int",
"source" : [ "tweet", "followed_by" ],
"map" : {
"destination" : "timeline",
"1" : {
"pk_columns" : {
"host" : "10.0.200.10",
"followed_by.follower_id" : "user_id",
"username" : "pac1251781019"
"tweet.user_id" : "tweet_user_id",
},
"tweet.tweet_id" : "tweet_id"
"2001" : {
},
"host" : "10.0.200.11",
"npk_columns" : {
"username" : "pac1251781332"
"tweet.ctime" : "ctime"
},
},
"4001" : {
"merge" : {
"host" : "10.0.200.12",
"tweet.user_id" : "followed_by.user_id"
"username" : "pac1251781408"
},
}
"shard-key" : "user_id"
}
}, {
"source" : "following",
"destination" : "followed_by",
"pk_columns" : {
"following.following_id" : "user_id",
"following.user_id" : "follower_id"
},
"shard-key" : "user_id"
}
]


Incline – FYI the generated triggers
CREATE TRIGGER _INCLINE_followed_by_INSERT AFTER INSERT ON followed_by FOR EACH NEW.following_id,NEW.user_id,'I';
ROW BEGIN
END IF;
IF (((1<=NEW.follower_id AND NEW.follower_id<2001))) THEN
ENDCREATE TRIGGER _INCLINE_following_DELETE AFTER DELETE ON following FOR EACH
INSERT INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT ROW BEGIN
NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id FROM tweet WHERE IF (((1<=OLD.following_id AND OLD.following_id<2001))) THEN
tweet.user_id=NEW.user_id;
DELETE FROM followed_by WHERE followed_by.user_id=OLD.following_id AND
ELSE
followed_by.follower_id=OLD.user_id;
INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action) ELSE
SELECT NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id,'I' FROM
INSERT INTO _iq_followed_by (user_id,follower_id,_iq_action) SELECT
tweet WHERE tweet.user_id=NEW.user_id;
OLD.following_id,OLD.user_id,'D';
END IF;
END IF;
END
END
CREATE TRIGGER _INCLINE_followed_by_UPDATE AFTER UPDATE ON followed_by FOR EACH CREATE TRIGGER _INCLINE_tweet_INSERT AFTER INSERT ON tweet FOR EACH ROW BEGIN
ROW BEGIN
INSERT INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT
IF (((1<=NEW.follower_id AND NEW.follower_id<2001))) THEN
followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id FROM
REPLACE INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT followed_by WHERE ((1<=followed_by.follower_id AND
NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id FROM tweet WHERE followed_by.follower_id<2001)) AND NEW.user_id=followed_by.user_id;
tweet.user_id=NEW.user_id;
INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action)
ELSE
SELECT followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id,'I' FROM
INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action) followed_by WHERE NOT (((1<=followed_by.follower_id AND
SELECT NEW.follower_id,tweet.ctime,tweet.tweet_id,tweet.user_id,'U' FROM followed_by.follower_id<2001))) AND NEW.user_id=followed_by.user_id;
tweet WHERE tweet.user_id=NEW.user_id;
END
END IF;
CREATE TRIGGER _INCLINE_tweet_UPDATE AFTER UPDATE ON tweet FOR EACH ROW BEGIN
END
REPLACE INTO timeline (user_id,ctime,tweet_id,tweet_user_id) SELECT
CREATE TRIGGER _INCLINE_followed_by_DELETE AFTER DELETE ON followed_by FOR EACH followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id FROM
ROW BEGIN
followed_by WHERE ((1<=followed_by.follower_id AND
IF (((1<=OLD.follower_id AND OLD.follower_id<2001))) THEN
followed_by.follower_id<2001)) AND NEW.user_id=followed_by.user_id;
DELETE FROM timeline WHERE timeline.user_id=OLD.follower_id AND INSERT INTO _iq_timeline (user_id,ctime,tweet_id,tweet_user_id,_iq_action)
tweet_user_id=OLD.user_id;
SELECT followed_by.follower_id,NEW.ctime,NEW.tweet_id,NEW.user_id,'U' FROM
ELSE
followed_by WHERE NOT (((1<=followed_by.follower_id AND
followed_by.follower_id<2001))) AND NEW.user_id=followed_by.user_id;
INSERT INTO _iq_timeline (user_id,tweet_id,tweet_user_id,_iq_action) SELECT
OLD.follower_id,tweet.tweet_id,tweet.user_id,'D' FROM tweet WHERE END
tweet.user_id=OLD.user_id;
CREATE TRIGGER _INCLINE_tweet_DELETE AFTER DELETE ON tweet FOR EACH ROW BEGIN
END IF;
DELETE FROM timeline WHERE timeline.tweet_id=OLD.tweet_id AND
timeline.tweet_user_id=OLD.user_id;
END
INSERT INTO _iq_timeline (tweet_id,tweet_user_id,user_id,_iq_action) SELECT
CREATE TRIGGER _INCLINE_following_INSERT AFTER INSERT ON following FOR EACH ROW
OLD.tweet_id,OLD.user_id,followed_by.follower_id,'D' FROM followed_by
BEGIN
WHERE OLD.user_id=followed_by.user_id AND NOT
IF (((1<=NEW.following_id AND NEW.following_id<2001))) THEN
(((1<=followed_by.follower_id AND followed_by.follower_id<2001)));
INSERT INTO followed_by (user_id,follower_id) SELECT
END
NEW.following_id,NEW.user_id;
ELSE
INSERT INTO _iq_followed_by (user_id,follower_id,_iq_action) SELECT


Pacific


Range-based sharding vs. hash-based

 Range-based sharding is better
 range queries are sometimes necessary
 manual tuning is easy
 number of nodes increase continuously
 with hash-based sharding, you have to add
1,2,4,8,16,32,64,... servers at once


Pacific

 utility programs for dynamic scaling
 mysqld_jumpstart
 pacific_divide


mysqld_jumpstart – summary

 create a mysqld instance in a single
command
 service automatically started by daemontools
 setup of primary nodes and slaves
 auto-generated backup script: install_dir/etc/
backup.sh
 uses XtraBackup for hot-backup


mysql_jumpstart – the commands
# create and start a master database
% mysqld_jumpstart --mysql-install-db=/usr/local/mysql/bin/
mysql_install_db --mysqld=/usr/local/mysql/libexec/mysqld --base-
dir=/var/servicedb --server-id=1252619462 --socket=/tmp/mysql-
servicedb.sock --service-dir=/service/mysql-servicedb --replication-
network='10.0.200.0/255.255.255.0'

# backup
% /var/servicedb/etc/backup.sh /var/backup/servicedb.backup.20090911

# create and start a slave database
% mysqld_jumpstart --mysql-install-db=/usr/local/mysql/bin/
mysql_install_db --mysqld=/usr/local/mysql/libexec/mysqld --base-
dir=/var/servicedb --server-id=1252619493 --socket=/tmp/mysql-
servicedb.sock --service-dir=/service/mysql-servicedb --replication-
network='10.0.200.0/255.255.255.0' --master-host=10.0.200.1 --from-
innobackupex


Splitting a MySQL shard

 use replication to prepare, then upgrade
a slave to master
Before:

1 2,000
2,001 4,000
4,001 6,000

replication
slave

After:

1 2,000
2,001 3,000
3,001 4,000
4,001 6,000


Problems in splitting a shard

 speed vs. safety
 downtime should be minimum
 guarantee that all the application servers write to
the new node
 reads may switch to the new node eventually


Pacific_divide – the blurbs

 fail-safe
 application servers using the old sharding
definition cannot access the split nodes
 app. servers reload the definition upon such case

 minimum impact on users
 no read-locks during division
 in eventual-consistency mode
 acquires write lock only against the dividing node
 write lock time < 10 seconds
 if no delay in replication

Pacific_divide – the split algorithm

1.  create a new slave node
2.  drop write privileges of existing username on the dividing
node
3.  wait until the new node becomes in sync.
4.  update incline triggers
5.  create new user and give read / write privileges
6.  update shard def.
7.  drop read privileges granted to the old username


Pacific_divide – the comand
# upgrade 10.0.200.18 to a master with range uid:3,000-
#
# when instructed by pacific_divide, transmit shard.json to all
# application servers and mysql shards (or you may use nfs, etc.)

% pacific_divide --shard-def=shard.json --database=microblog --new-
host=10.0.200.18 --from-id=3000 --incline-source=microblog.json

Before:

1 2,000
2,001 4,000
4,001 6,000

replication
slave

After:

1 2,000
2,001 3,000
3,001 4,000
4,001 6,000


Pacific_divide – how the shard def. changes
# before
# after

{
{
"map" : {
"map" : {
"1" : {
"1" : {
"host" : "10.0.200.10",
"host" : "10.0.200.10",
"username" : "pac1251781019"
"username" : "pac1251781019"
},
},
"2001" : {
"2001" : {
"host" : "10.0.200.11",
"host" : "10.0.200.11",
"username" : "pac1251781332"
"username" : "pac1252624011"
},
},
"4001" : {
"3001" : {
"host" : "10.0.200.12",
"host" : "10.0.200.18",
"username" : "pac1251781408"
"username" : "pac1252624011"
}
},
}
"4001" : {
"host" : "10.0.200.12",
"username" : "pac1251781408"
}
}


DBIx::ShardManager


DBIx::ShardManager – the code
# create manager object
my $mgr = DBIx::ShardManager->new(
definition => DBIx::ShardManager::Definition::JSON->new(
file => 'etc/user_shard_def.json',
auto_reload => 1,
),
connector => DBIx::ShardManager::Connector::DBI->new(
driver => 'mysql',
dbname => 'microblog',
attr => {
mysql_enable_utf8 => 1,
RaiseError => 1,
},
),
);


DBIx::ShardManager – the code (cont'd)
# read user's timeline

# first, read my timeline table
my $timeline = $mgr->rw_handle($user_id)->selectall_arrayref(
'SELECT * FROM timeline WHERE user_id=? ORDER BY ctime DESC LIMIT
20',
{ Slice => {} },
$user_id,
);
# fetch the tweets using (tweet_user_id,tweet_id) from other shards
$mgr->shard_inner_join(
$timeline,
tweet_user_id => {
'tweet.tweet_id' => 'tweet_id',
},
}


DBIx::ShardManager – blurbs

 access to raw DBI handles
 easy to use ORM above DBIx::ShardManager
 detects changes and reloads shard def.
 but may throw exceptions on writes during node
divisions by pacific_divide
 display maintenance error, and let the user retry

 shard_join to be optimized
 with Net::Drizzle, or mycached


Conclusion


Conclusion

 RDB sharding is not difficult when using
Incline, Pacific, DBIx::ShardManager
 IMO it is as easy as writing code for a standalone
database system
 app. developers can use 2-phase commit
if necessary
 or rely on Incline for async. updates


Current Status & ToDo

 Incline - early beta
 ToDo: add support for multiple shard keys, add
recovery support on data-loss
 Pacific - early beta
 ToDo: make it a distribution
 DBIx::ShardManager - still alpha
 ToDo: write more join functions, concurrent
access, etc.


Miscellaneous

 Mycached
 currently in alpha status
 access MySQL tables using memcached protocol
 higher concurrency (thousands of connections)
 higher throughput (2x SQL)


For more information

 see my blog
http://developer.cybozu.co.jp/kazuho/
 DBIx::ShardManager is in coderepos.org/share/
lang/perl
 come to BPStudy #25 on 9/25
 2h30m talk on Incline, Pacific,
DBIx::ShardManager (hopefully including demos)


A Clever Way to Scale-out a Web Application

More Related Content

A Clever Way to Scale-out a Web Application