Architecture by Accident

Architecture by accident

Gleicon
Moraes

Agenda

•  Architecture for data - even if you don’t want it
•  Databases
•  Message Queues
•  Cache

Architecture
“Everyone
has
a
plan
un4l
they
get
punched
in

the
mouth”
–
Mike
Tyson

Even if you dont want it ...

•  There is an innate architecture on everything
•  You may end up with more data than you had
planned to
•  You may get away from your quick and dirty CRUD
•  You probably are querying more than one Database
•  At some point you laugh when your boss asks you
about 'Integrating Systems'
•  Code turns into legacy - and so architectures
•  'Scattered' is not the same that 'Distributed'

It
usually
starts
like
this

App Server Database

then

App Servers Database

it

App Servers Master DB

Slave DB

goes


Slave DB

Cache

like

Slave DB

Cache

Indexing Service

this

Slave DB

Cache

Indexing Service

API Servers

and

Load Balancer/Reverse Proxy

Slave DB

Cache

Indexing Service

API Servers

beyond

Load Balancer/Reverse Proxy

Slave DB

Cache

Indexing Service

API Servers
Auth Service

Problem is...
An architect s first work is apt to be spare and clean. He knows he
doesn t know what he s doing, so he does it carefully and with great
restraint.

As he designs the first work, frill after frill and embellishment after
embellishment occur to him. These get stored away to be used next
time. Sooner or later the first system is finished, and the architect,
with firm confidence and a demonstrated mastery of that class of
systems, is ready to build a second system.

This second is the most dangerous system a man ever designs. When
he does his third and later ones, his prior experiences will confirm each
other as to the general characteristics of such systems, and their
differences will identify those parts of his experience that are particular
and not generalizable.

The general tendency is to over-design the second system, using all
the ideas and frills that were cautiously sidetracked on the first one.
The result, as Ovid says, is a big pile.

— Frederick P. Brooks, Jr.
The Mythical Man-Month

Databases

•  Not
an
oﬀ-‐the-‐shelf
architectural
duct
tape

•  Not
only
rela4onal,
other
paradigms

•  Usually
the
last
place
sought
for
op4miza4on

•  Usually
the
ﬁrst
place
to
accomodate
last
minute

changes

•  Good
ideas
to
try
out:
Sharding
and

Denormaliza4on

•  Some
of
your
problems
may
require
something

other
than
a
Rela4onal
Database

Relevant RDBMS Anti-Patterns

–  Dynamic table creation
–  Table as cache
–  Table as queue
–  Table as log file
–  Distributed Global Locking
–  Stoned Procedures
–  Row Alignment
–  Extreme JOINs
–  Your ORM issue full queries for Dataset iterations
–  Throttle Control

Dynamic table creation
Problem: To avoid huge tables, "dynamic schema” is
created. For example, lets think about a document
management company, which is adding new facilities over
the country. For each storage facility, a new table is created:

item_id - row - column - stuff
1 - 10 - 20 - cat food
2 - 12 - 32 - trout

Side Effect: "dynamic queries", which will probably query a
"central storage" table and issue a huge join to check if you
have enough cat food over the country. It’s different from
Sharding.

Alternative:
- Document storage, modeling a facility as a document
-  Key/Value, modeling each facility as a SET
-  Sharding properly

Table as cache
Problem: Complex queries demand that a result be
stored in a separated table, so it can be queried
quickly. Worst than views

Alternative:
- Really ?
- Memcached
- Redis + AOF + EXPIRE
- Denormalization

Table as queue
Problem: A table which holds messages to be
completed. Worse, they must be sorted by date.

Alternative:
- RestMQ, Resque
- Any other message broker
- Redis (LISTS - LPUSH + RPOP)
- Use the right tool

Table as log file
Problem: A table in which data gets written as a log
file. From time to time it needs to be purged.
Truncating this table once a day usually is the first task
assigned to trainee DBAs.

Alternative:
- MongoDB capped collection
- Redis, and a RRD pattern
-  RIAK

Distributed Global Locking
Problem: Someone learns java and synchronize. A bit
later genius thinks that a distributed synchronize would
be awesome. The proper place to do that would be the
database of course. Start with a reference counter in a
table and end up with this:

> select COALESCE(GET_LOCK('my_lock',0 ),0 )

Plain and simple, you might find it embedded in a
magic class called DistributedSynchronize or
ClusterSemaphore. Locks, transactions and reference
counters (which may act as soft locks) doesn't belongs
to the database.

Stoned procedures
Problem: Stored procedures hold most of your
applications logic. Also, some triggers are used to - well
- trigger important data events.

SP and triggers has the magic property of vanishing of
our mind instantly, being impossible to keep versioned.

Alternative:
- Careful so you don’t use map/reduce as stoned
procedures.
- Use your preferred language for business stuff, and
let event handling to pub/sub or message queues.

Row Alignment
Problem: Extra rows are created but not used, just in
case. Usually they are named as a1, a2, a3, a4 and
called padding.

There's good will behind that, specially when version 1
of the software needed an extra column in a 150M lines
database and it took 2 days to run an ALTER TABLE.

Alternative:
- Document based databases as MongoDB and
CouchDB, where new atributes are local to the
document. Also, having no schema helps

- Column based databases may be not the best choice
if column creation need restart/migrations

Extreme JOINs
Problem: Business rules modeled as tables. Table
inheritance (Product -> SubProduct_A). To find the
complete data for a user plan, one must issue gigantic
queries with lots of JOINs.

Alternative:
- Document storage, as MongoDB
- Denormalization
- Serialized objects

Your ORM ...
Problem: Your ORM issue full queries for dataset
iterations, your ORM maps and creates tables which
mimics your classes, even the inheritance, and the
performance is bad because the queries are huge, etc,
etc

Alternative:
Apart from denormalization and good old common
sense, ORMs are trying to bridge two things with
distinct impedance.

There is nothing to relational models which maps
cleanly to classes and objects. Not even the basic unit
which is the domain(set) of each column. Black Magic ?

Throttle Control
Problem: A request tracker to create a throttle control by IP
address, login, operation or any other event using a relational
database

Ranging from an update … select to a lock/transaction block,
no relational database would be the best place to do that.

Alternative: use memcached, redis or any other DHT which has
expiration by creating a key as
THROTLE:<IP>:YYYYMMDDHH and increment it. At first
glance sounds the same but the expiration will take care of
cleaning up old entries. Also search time is the same as looking
for a key.

No silver bullet
- Consider alternatives

- Think outside the norm

- Denormalize

- Simplify

Cycle of changes - Product A
1. There was the database model

2. Then, the cache was needed. Performance was no good.

3. Cache key: query, value: resultset

4. High or inexistent expiration time [w00t]

(Now there's a turning point. Data didn't need to change often.
Denormalization was a given with cache)

5. The cache needs to be warmed or the app wont work.

6. Key/Value storage was a natural choice. No data on MySQL
anymore.

Cycle of changes - Product B
1. Postgres DB storing crawler results.

2. There was a counter in each row, and updating this counter
caused contention errors.

3. Memcache for reads. Performance is better.

4. First MongoDB test, no more deadlocks from counter
update.

5. Data model was simplified, the entire crawled doc was
stored.

Stuff to think about
Think if the data you use aren't denormalized (cached)

Most of the anti-patterns contain signs that a non-relational
route (or at least a partial route) may help.

Are you dependent on cache ? Does your application fails when
there is no cache ? Does it just slows down ?

Are you ready to think more about your data ?

Think about the way to put and to get back your data from the
database (be it SQL or NoSQL).

Extra - MongoDB and Redis
The next two slides are here to show what is like to use
MongoDB and Redis for the same task.

There is more to managing your data than stuffing it inside a
database. You gotta plan ahead for searches and migrations.

This example is about storing books and searching between
them. MongoDB makes it simpler, just liek using its query
language. Redis requires that you keep track of tags and ids to
use SET operations to recover which books you want.

Check http://rediscookbook.org and http://
cookbook.mongodb.org/ for recipes on data handling.

MongoDB/Redis recap - Books
MongoDB
Redis

{

'id': 1,

'title' : 'Diving into Python',
SET book:1 {'title' : 'Diving into Python',

'author': 'Mark Pilgrim', 'author': 'Mark Pilgrim'}

'tags': ['python','programming', 'computing']
SET book:2 { 'title' : 'Programing Erlang',

}
'author': 'Joe Armstrong'}
SET book:3 { 'title' : 'Programing in Haskell',

{

'author': 'Graham Hutton'}

'id':2,

'title' : 'Programing Erlang',

'author': 'Joe Armstrong',
SADD tag:python 1
SADD tag:erlang 2

'tags': ['erlang','programming', 'computing',

SADD tag:haskell 3

'distributedcomputing', 'FP']

SADD tag:programming 1 2 3

} SADD tag computing 1 2 3

{

SADD tag:distributedcomputing 2

'id':3,
SADD tag:FP 2 3

'title' : 'Programing in Haskell',

'author': 'Graham Hutton',

'tags': ['haskell','programming', 'computing', 'FP']

}

MongoDB/Redis recap - Books
MongoDB
Redis

Search tags for erlang or haskell:

SINTER 'tag:erlang' 'tag:haskell'

db.books.find({"tags":
{ $in: ['erlang', 'haskell']

0 results
}

}) SINTER 'tag:programming' 'tag:computing'

3 results: 1, 2, 3
Search tags for erlang AND haskell (no results)
SUNION 'tag:erlang' 'tag:haskell'

2 results: 2 and 3

{ $all: ['erlang', 'haskell']

}
SDIFF 'tag:programming' 'tag:haskell'

}) 2 results: 1 and 2 (haskell is excluded)

This search yields 3 results


{ $all: ['programming', 'computing']

}

})

Decoupling db writes with Message Queues

Async HTML Scrapper
Fetch Page
1st parse

Fetch Page
Message Queue
1st parse

Consumer

Fetch Page
1st parse

Fetch Page
1st parse

M/R

Fetch Data
Map(Fun)

Fetch Data
Message Queue
Map(Fun)

Reduce

Fetch Data
Map(Fun)

Fetch Data
Map(Fun)

M/R
–
Wordcount(Map)

M/R
–
Wordcount(Reduce)

HTML processing - no cache

http://github.com/gleicon/vuvuzelr/proxy_no_cache.rb

HTML processing - Cached

http://github.com/gleicon/vuvuzelr/proxy.rb

Thanks

•  @gleicon

•  hQp://www.7co.cc

•  hQp://github.com/gleicon

•  gleicon@gmail.com

Architecture by Accident

More Related Content

Architecture by Accident