Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Foursquare outage post mortem

14,694 views
Skip to first unread message

Eliot Horowitz

unread,
Oct 7, 2010, 6:05:18 PM10/7/10
(Note: this is being posted with Foursquare’s permission.)

As many of you are aware, Foursquare had a significant outage this
week. The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins. This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.

It’s important to note that throughout this week, 10gen and Foursquare
engineers have been working together very closely to resolve the
issue.

* Some history
Foursquare has been hosting check-ins on a MongoDB database for some
time now. The database was originally running on a single EC2
instance with 66GB of RAM. About 2 months ago, in response to
increased capacity requirements, Foursquare migrated that single
instance to a two-shard cluster. Now, each shard was running on its
own 66GB instance, and both shards were also replicating to a slave
for redundancy. This was an important migration because it allowed
Foursquare to keep all of their check-in data in RAM, which is
essential for maintaining acceptable performance.

The data had been split into 200 evenly distributed chunks based on
user id. The first half went to one server, and the other half to the
other. Each shard had about 33GB of data in RAM at this point, and
the whole system ran smoothly for two months.

* What we missed in the interim
Over these two months, check-ins were being written continually to
each shard. Unfortunately, these check-ins did not grow evenly across
chunks. It’s easy to imagine how this might happen: assuming certain
subsets of users are more active than others, it’s conceivable that
their updates might all go to the same shard. That’s what occurred in
this case, resulting in one shard growing to 66GB and the other only
to 50GB. [1]

* What went wrong
On Monday morning, the data on one shard (we’ll call it shard0)
finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
machine. Whenever data size grows beyond physical RAM, it becomes
necessary to read and write to disk, which is orders of magnitude
slower than reading and writing RAM. Thus, certain queries started to
become very slow, and this caused a backlog that brought the site
down.

We first attempted to fix the problem by adding a third shard. We
brought the third shard up and started migrating chunks. Queries were
now being distributed to all three shards, but shard0 continued to hit
disk very heavily. When this failed to correct itself, we ultimately
discovered that the problem was due to data fragmentation on shard0.
In essence, although we had moved 5% of the data from shard0 to the
new third shard, the data files, in their fragmented state, still
needed the same amount of RAM. This can be explained by the fact that
Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page. Removing 5% of these just made
each page a little more sparse, rather than removing pages
altogether.[2]

After the first day's outage it had become clear that chunk migration,
sans compaction, was not going to solve the immediate problem. By the
time the second day's outage occurred, we had already move 5% of the
data off of shard0, so we decided to start an offline process to
compact the database using MongoDB’s repairDatabase() feature. This
process took about 4 hours (partly due to the data size, and partly
because of the slowness of EBS volumes at the time). At the end of
the 4 hours, the RAM requirements for shard0 had in fact been reduced
by 5%, allowing us to bring the system back online.

* Afterwards
Since repairing shard0 and adding a third shard, we’ve set up even
more shards, and now the check-in data is evenly distributed and there
is a good deal of extra capacity. Still, we had to address the
fragmentation problem. We ran a repairDatabase() on the slaves, and
promoted the slaves to masters, further reducing the RAM needed on
each shard to about 20GB.

* How is this issue triggered?
Several conditions need to be met to trigger the issue that brought
down Foursquare:
1. Systems are at or over capacity. How capacity is defined varies; in
the case of Foursquare, all data needed to fit into RAM for acceptable
performance. Other deployments may not have such strict RAM
requirements.
2. Document size is less than 4k. Such documents, when moved, may be
too small to free up pages and, thus, memory.
3. Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks.

Most sharded deployments will not meet these criteria. Anyone whose
documents are larger than 4KB will not suffer significant
fragmentation because the pages that aren’t being used won’t be
cached.

* Prevention
The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small. However, if caught in advance, adding more shards on a
live system can be done with no downtime.

For example, if we had notifications in place to alert us 12 hours
earlier that we needed more capacity, we could have added a third
shard, migrated data, and then compacted the slaves.

Another salient point: when you’re operating at or near capacity,
realize that if things get slow at your hosting provider, you may find
yourself all of a sudden effectively over capacity.

* Final Thoughts
The 10gen tech team is working hard to correct the issues exposed by
this outage. We will continue to work as hard as possible to ensure
that everyone using MongoDB has the best possible experience. We are
thankful for the support that we have received from Foursquare and our
community during this unfortunate episode. As always, please let us
know if you have any questions or concerns.

[1] Chunks get split when they are 200MB into 2 100MB halves. This
means that even if the number of chunks on each shard was the same,
data size is not always so. This is something we are going to be
addressing in MongoDB. We'll be making splitting balancing look for
this imbalance so it can act upon it.

[2] The 10gen team is working on doing online incremental compaction
of both data files and indexes. We know this has been a concern in
non-sharded systems as well. More details about this will be coming
in the next few weeks.

harryh

unread,
Oct 7, 2010, 7:32:31 PM10/7/10
to mongodb-user
Just as a quick follow-up to Eliot's detailed message on our outage I
want to make it clear to the community how helpful Eliot and the rest
of the gang at 10gen has been to us in the last couple days. We've
been in near constant communication with them as we worked to repair
things and end the outage, and then to identify and get to work on
fixing things so that this sort of thing won't happen again.

They've been a huge help to us, and things would have been much worse
without them.

Further, I am very confident that some of the issues that we've
identified in Mongo will be dealt with so others don't encounter
similar problems. Overall we still remain huge fans of MongoDB @
foursquare, and expect to be using it for a long time to come.

I'm more than happy to help answer any concerns that others might have
about MongoDB as it relates to this incident.

-harryh, eng lead @ foursquare

Nat

unread,
Oct 7, 2010, 7:47:29 PM10/7/10
to mongodb-user
Thanks for very insightful post-mortem. I have a couple more questions
to ask:
- Can repairDatabase() leverage multiple cores? Given that data is
broken down into chunks, can we process them in parallel? 4 hours
downtime seems to be eternal in internet space.
- It seems that when data grows out of memory, the performance seems
to degrade significantly. If only part of the data pages are touched
(because only some users are active), why the paging/caching mechanism
is not effective and cause the disk read/write rates to be high and
bog down the request processing?
- For shard key selection, it looks like if insertion order determines
shard key, it will cause an uneven load on a machine for high insert.
If not, it may cause some problem when data is moved. What should be
the best practices? Does it depend on type of load?
- Is there some way to identify and alert that additional shards/
machines are needed by looking at the statistics information? It would
be very helpful to prevent this kind of problem. We have a similar
situation as well. The performance is not degraded gradually. When
something hits, it hits really hard and we have to take action
immediately but additional machines may not be available immediately.

Thanks,

Alex Popescu

unread,
Oct 7, 2010, 8:01:29 PM10/7/10
to mongodb-user
Firstly, thanks a lot to both Foursquare and 10gen engineers/people
for sharing these details.

While not a MongoDB expert as most of the people involved here, I do
feel that there are quite a few open questions.
I have summarized them (and put them in context)
http://nosql.mypopescu.com/post/1265191137/foursquare-mongodb-outage-post-mortem,
but for quick reference:

1. were writes configured to work with MongoDB's fire and forget
behavior?

2. were replicas used for distributing reads?

3. could you bring up read-only replicas to deal with the increased
read access time?

4. why the new shard could take only 5% of the data?

5. is there a real solution to the chunk migration/page size issue?

6. how could one plan for growth when sharding is based on a key that
doesn't follow insertion order?

thanks again,

:- alex

Marc Esher

unread,
Oct 7, 2010, 8:14:14 PM10/7/10
to mongodb-user
I'm curious if foursquare had other options for shard key. Can anyone
speak to whether, if they were starting from scratch, they have better
choices now for choosing shard keys?

On Oct 7, 8:01 pm, Alex Popescu <[email protected]>
wrote:
> Firstly, thanks a lot to both Foursquare and 10gen engineers/people
> for sharing these details.
>
> While not a MongoDB expert as most of the people involved here, I do
> feel that there are quite a few open questions.
> I have summarized them (and put them in context)http://nosql.mypopescu.com/post/1265191137/foursquare-mongodb-outage-...,

Eliot Horowitz

unread,
Oct 7, 2010, 8:16:27 PM10/7/10
> - Can repairDatabase() leverage multiple cores? Given that data is
> broken down into chunks, can we process them in parallel? 4 hours
> downtime seems to be eternal in internet space.

Currently no. We're going to be working on doing compaction in the
background though so you won't have to do this large offline
compaction at all.

> -  It seems that when data grows out of memory, the performance seems
> to degrade significantly. If only part of the data pages are touched
> (because only some users are active), why the paging/caching mechanism
> is not effective and cause the disk read/write rates to be high and
> bog down the request processing?

Foursquare needed read access to more blocks per second than even a 4
drive raid could keep up with.
This is not normal for most sites, as many MongoDB installations have
storage sizes 2x or even up to 100x ram.
MongoDB should be better about handling situations like this and
degrade much more gracefully.
We'll be working on these enhancements soon as well.

> - For shard key selection, it looks like if insertion order determines
> shard key, it will cause an uneven load on a machine for high insert.
> If not, it may cause some problem when data is moved. What should be
> the best practices? Does it depend on type of load?

Its very hard to give a general answer. As you said one based on
insertion order means 1 shard takes all the writes.
In some cases that's fine. If total data size is massive, but insert
load is modest, then that might be a great solution.
If you need to distribue writes, then you might fall into this other
issue. We're working on compaction so it won't actually be an issue,
but in the meantime you can follow the instructions above to add
shards.

> - Is there some way to identify and alert that additional shards/
> machines are needed by looking at the statistics information? It would
> be very helpful to prevent this kind of problem. We have a similar
> situation as well. The performance is not degraded gradually. When
> something hits, it hits really hard and we have to take action
> immediately but additional machines may not be available immediately.

First you have to understand what resources your system cares most about.
It can be ram, disk or cpu. If its ram or cpu, you can use the mongo
db.stats() command to see total requirements.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to [email protected].
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Eliot Horowitz

unread,
Oct 7, 2010, 8:19:28 PM10/7/10
> 1. were writes configured to work with MongoDB's fire and forget
> behavior?

I'm not actually sure - but does not really impact anything else.

> 2. were replicas used for distributing reads?

No - all reads needed to be consistent.

> 3. could you bring up read-only replicas to deal with the increased
> read access time?

The read load is so heavy - that even double the number of disks
wouldn't have helped.

> 4. why the new shard could take only 5% of the data?

It could take more, and in fact it now has 50%, we were trying to move
as little as possible to get the site back up as fast as possible.

> 5. is there a real solution to the chunk migration/page size issue?

Yes - we are working on online compaction would mitigate this.

> 6. how could one plan for growth when sharding is based on a key that
> doesn't follow insertion order?

In the post we describe a method for adding shards without any
downtime in that case.
Briefly, its add shard, let balancing move data, compact slaves, swap
master and slave.


>
> On Oct 8, 2:32 am, harryh <[email protected]> wrote:
>> Just as a quick follow-up to Eliot's detailed message on our outage I
>> want to make it clear to the community how helpful Eliot and the rest
>> of the gang at 10gen has been to us in the last couple days.  We've
>> been in near constant communication with them as we worked to repair
>> things and end the outage, and then to identify and get to work on
>> fixing things so that this sort of thing won't happen again.
>>
>> They've been a huge help to us, and things would have been much worse
>> without them.
>>
>> Further, I am very confident that some of the issues that we've
>> identified in Mongo will be dealt with so others don't encounter
>> similar problems.  Overall we still remain huge fans of MongoDB @
>> foursquare, and expect to be using it for a long time to come.
>>
>> I'm more than happy to help answer any concerns that others might have
>> about MongoDB as it relates to this incident.
>>
>> -harryh, eng lead @ foursquare
>

Eliot Horowitz

unread,
Oct 7, 2010, 8:21:21 PM10/7/10
The only shard key that might have helped was one based on insertion order.
The problem with that is then loading data for a single user would hit
multiple machines.
While workable, it's a lot cleaner to shard based on uid.
The choice of shard key is not wrong, MongoDB just needs to do online
compaction so migrates reduce the number of active pages.

Suhail Doshi

unread,
Oct 7, 2010, 8:16:42 PM10/7/10
to mongodb-user
I think the most obvious question is: How do we avoid maxing out our
mongodb nodes and know when to provision new nodes?

With the assumptions that: you're monitoring everything.

What do we need to look at? How do we tell? What happens if you're a
company with lots of changing scaling needs and features...it's tough
to capacity plan when you're a startup.

One solution would've been to memory limit how much mongo uses but I
don't think this is possible given Mongo uses mmpa'd files right now.
Merely watching IO and Swap I don't think is a solution since it may
be too late at that point.

Suhail

Marc Esher

unread,
Oct 7, 2010, 8:52:13 PM10/7/10
to mongodb-user
I'm curious if foursquare had other options for shard key. Can anyone
speak to whether, if they were starting from scratch, they have better
choices now for choosing shard keys?

On Oct 7, 8:01 pm, Alex Popescu <[email protected]>
wrote:
> Firstly, thanks a lot to both Foursquare and 10gen engineers/people
> for sharing these details.
>
> While not a MongoDB expert as most of the people involved here, I do
> feel that there are quite a few open questions.
> I have summarized them (and put them in context)http://nosql.mypopescu.com/post/1265191137/foursquare-mongodb-outage-...,

MattK

unread,
Oct 7, 2010, 8:54:14 PM10/7/10
to mongodb-user
On Oct 7, 6:05 pm, Eliot Horowitz <[email protected]> wrote:
> 1. Systems are at or over capacity. How capacity is defined varies; in
> the case of Foursquare, all data needed to fit into RAM for acceptable
> performance. Other deployments may not have such strict RAM
> requirements.

How does one determine what the RAM requirements are? For databases
that extend into the TB range, the same quantity of RAM is not an
option.

Is the MongoDB formula significantly different from the traditional
RDBMS, where performance is best when RAM buffers are large enough to
contain most commonly accessed / MRU blocks?

How much of Foursquare's problem is related to the lower IO rates of
EC2/EBS compared to fast, local, dedicated disks?

Eliot Horowitz

unread,
Oct 7, 2010, 8:57:35 PM10/7/10
See below...

> I think the most obvious question is: How do we avoid maxing out our
> mongodb nodes and know when to provision new nodes?
> With the assumptions that: you're monitoring everything.
>
> What do we need to look at? How do we tell? What happens if you're a
> company with lots of changing scaling needs and features...it's tough
> to capacity plan when you're a startup.

This is very application dependent. In some cases you need all your
indexes in ram, in other cases it can just be a small working set.
One good way to determine this is figure out how much data you
determine in a 10 minute period. Indexes and documents. Make sure
you can either store that in ram or read that from disk in the same
amount of time.

> One solution would've been to memory limit how much mongo uses but I
> don't think this is possible given Mongo uses mmpa'd files right now.
> Merely watching IO and Swap I don't think is a solution since it may
> be too late at that point.

This wouldn't really help. Generally the more ram you have the
better. The memory MongoDB uses is to cache data, so the more you
give it the better.

Eliot Horowitz

unread,
Oct 7, 2010, 8:59:45 PM10/7/10
> How does one determine what the RAM requirements are? For databases
> that extend into the TB range, the same quantity of RAM is not an
> option.

You need to determine working set. See recent comment or 2 above this one.

> Is the MongoDB formula significantly different from the traditional
> RDBMS, where performance is best when RAM buffers are large enough to
> contain most commonly accessed / MRU blocks?

Its really the same in MongoDB as a traditional RDBMs.

> How much of Foursquare's problem is related to the lower IO rates of
> EC2/EBS compared to fast, local, dedicated disks?

I don't think even fast local disks would have prevented this. There
is a chance SSDs might have, but we have not done all the math to know
for sure.
Faster disks would probably have made it easier to manipulate things
to reduce downtime howerver.

Nathan Folkman (Foursquare)

unread,
Oct 7, 2010, 8:58:30 PM10/7/10
to mongodb-user
One metric to keep an eye on is the database's and/or collection's
storageSize + the totalIndexSize. Also increased read rates can tip
you off that there might be a problem. Last suggestion is to get
comfortable with the mongostats tool - it's invaluable!

Nathan Folkman (Foursquare)

unread,
Oct 7, 2010, 9:03:09 PM10/7/10
to mongodb-user
In our case we found that the EBS read rates were not sufficient if
the data + indexes didn't fit into RAM. I'd caution that this is
specific to how we're accessing this collection, and to our read/write
rates. We're running with 4 EBS volumes (RAID-0) with a small bit of
tuning, which definitely helps.

Roger Binns

unread,
Oct 8, 2010, 12:52:39 AM10/8/10
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/07/2010 04:47 PM, Nat wrote:
> - It seems that when data grows out of memory, the performance seems
> to degrade significantly.

It is when the working set grows out of memory that you have a problem not
"data". Note that this is a problem with any kind of system and not
specific to MongoDB.

What happens when running out of memory (ie more operations proceed at disk
speeds) is that things will get a lot worse because of concurrency. If a
query now takes a second instead of 100ms then it is far more likely another
query will come in during that extra time. Both queries will now be
competing for the disk. Then a third will come along making things even
worse again. The disk seek times really hurt.

What MongoDB needs to do is decrease concurrency under load so that existing
queries can complete before allowing new ones to add fuel to the fire.

There is a ticket for this:

http://jira.mongodb.org/browse/SERVER-574

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkyuo5IACgkQmOOfHg372QRgAwCeILzjvefjjWOqz38ur0ENy0Z6
l6AAnjM/E392E5YtGd7c6t9FozITIveS
=EWrw
-----END PGP SIGNATURE-----

Alberto

unread,
Oct 8, 2010, 3:29:38 AM10/8/10
to mongodb-user
Would be possible to get a tool that could gather info / stats for a
few hours / days and do some suggestions.

Maybe it could suggest shard keys, suggest memory size, new indexes,
etc...


I'm sure that a few of us do wrong querys, and it could even warn of
using group instead of map_reduce if we have less than 10000 keys,
etc... ( Good practices )

mongonix

unread,
Oct 8, 2010, 4:03:22 AM10/8/10
to mongodb-user
Hi,

Just a quick idea:

Claim: The main problem was the fact that fragmentation was happening
under very specific unexpected conditions. And even though the system
was running for months, it was not foreseen or expected.

Counter-action: Try to simulate the production system to get an idea
about different kind of problems in advance!

Therefore, ask your biggest/most important customers about their
typical ways of using MongoDB.
Typical document sizes, number of queries per time unit, read/write
ratios, kind of queries, clustering and sharding setup, etc. May be
even for examples of real data.

Then try to simulate this setup at your own premises using the same/
similar characteristics as your customers do. Do some sort of
randomized/non-deterministic testing. Try to see what would happen to
the system, if major factors are changed, e.g. size of documents,
number of requests, network speed, etc. This way, many kinds of
problems can be detected in this simulated setup before they happen
for real in production.

Does it make sense for you?

- Leo

P.S. Of course, this approach does not apply only to MongoDB. It can
be used almost for any system, where simulation is possible.

Markus Gattol

unread,
Oct 8, 2010, 6:30:31 AM10/8/10

[skipping a lot of lines ...]

MattK> How does one determine what the RAM requirements are? For
MattK> databases that extend into the TB range, the same quantity of
MattK> RAM is not an option.

As already mentioned, every use case is different; what is important is
that you have enough RAM to house your working set and indexes

http://www.markus-gattol.name/ws/mongodb.html#how_much_ram_does_mongodb_need

MattK> Is the MongoDB formula significantly different from the
MattK> traditional RDBMS, where performance is best when RAM buffers
MattK> are large enough to contain most commonly accessed / MRU blocks?

The formula is not different but the impact in both directions (enough
vs not enough RAM to hold the working set and indices) carries more
weight i.e. you will usually (with default configurations) see MongoDB
being faster if there is enough RAM but worse (compared to RDBMs) if
MongoDB needs to hit the disk.

MattK> How much of Foursquare's problem is related to the lower IO
MattK> rates of EC2/EBS compared to fast, local, dedicated disks?

If you need to go to disks for I/O except the default fsync etc. then
you already have a problem. The best solution to mitigate the pain then
however is something that does random I/O i.e. SSD (Solid State Drive)
rather than usual HDDs.

http://www.markus-gattol.name/ws/mongodb.html#speed_impact_of_not_having_enough_ram


Dinesh

unread,
Oct 8, 2010, 1:25:34 AM10/8/10
to mongodb-user
Hi Eliot,

Your post was very informative. One question that remains (and I cant
believe that nobody has raised it yet) is - Why was there no
monitoring system in place to fore-warn your team?

Isn't this a basic requirement? I mean you dont expect the data to
grow beyond your RAM is a very risky constraint. So there should've
been *some* sort of monitoring in place to warn you that you're about
to run out of resources. Its a very basic requirement according to me?
Something like "hey, i'm doing malloc(). Let me check if theres
ACTUALLY memory available for me.". If such a monitoring system was in
place and hooked up to your foursquare app, users would've gotten
error messages while checking in, BUT their existing check-ins
would've been available and wouldn't have led to the whole site/
service going down.

Dinesh

Markus Gattol

unread,
Oct 8, 2010, 8:47:15 AM10/8/10

[skipping a lot of lines ...]

Dinesh> I mean you dont expect the data to grow beyond your RAM is a
Dinesh> very risky constraint.

I would disagree with that. The fact is that the optimal situation is
that all you data set including indices fit into RAM. However, this will
only be true for small sites using MongoDB as data tier in some sort.

The key concern is whether or not the working set fits into RAM at all
times. This pattern is different per use case and situation (slashdot
effect). With sites as foursquare it will always be so that your data
set is bigger than your actual working set and certainly also available
RAM. Monitoring whether or not your data set is growing bigger than RAM
therefore does not make sense. You as a technician need to know about
your use case and plan/develop accordingly as MongoDB can not free you
from those duties.

Dinesh> So there should've been *some* sort of monitoring in place to
Dinesh> warn you that you're about to run out of resources.

There are statistics available you can use
http://www.markus-gattol.name/ws/mongodb.html#server_status

See also the links for the munin and nagios plugins from this link.
Bottom line is, the information is manifold and available, it is up the
person who runs a setup to interpret it and act/plan accordingly
(rationale see above).

Eliot Horowitz

unread,
Oct 8, 2010, 10:01:52 AM10/8/10
Knowing what the actual limit can be tricky.
There was monitoring in place, but the limit for when to add capacity
wasn't quite right.

Jonathan Ultis

unread,
Oct 8, 2010, 12:15:12 PM10/8/10
to mongodb-user
That is why setting a limit on the RAM used by MongoDB would be
useful. If it were possible to set a limit, they could have limited
the machine to 64GB resident. Then, when the system started to fall
over, they could stop MongoDB, bump the limit to 66GB, then turn
MongoDB back on.

Placing a limit on the memory used by MongoDB artificially reduces the
overall system capacity. But, it creates a more reliable early warning
with a fast path to a temporary fix. The temporary fix will usually
last long enough to get new shards online.

Brainstorming.... one could allocate memory in another process, mlock
it so that it can't be swapped out, and then just let it sit in an
infinite sleep. If the server starts to fall over due to lack of
memory, kill the process with the mlocked memory, and boom. You have
runway.

-Jonathan

David Birdsong

unread,
Oct 8, 2010, 1:14:29 PM10/8/10
On Fri, Oct 8, 2010 at 9:15 AM, Jonathan Ultis <[email protected]> wrote:
> That is why setting a limit on the RAM used by MongoDB would be
> useful. If it were possible to set a limit, they could have limited
> the machine to 64GB resident. Then, when the system started to fall
> over, they could stop MongoDB, bump the limit to 66GB, then turn
> MongoDB back on.
>
AFAIK, Mongo isn't malloc'ing memory. you can mmap 100x physical RAM
on the box. it isn't resident memory; it's all virtual. If mongo
dirties pages beyond the configured amount that the system allows to
be dirty, then the vm manager will flush pages to the backnig mmap
file. Usually that configured amount is derived via percentages of
the system RAM. In foursquare's case, they were dirtying pages above
that threshold and the debt owed to IOwait piled up. I'm not negating
your later idea though.

> Placing a limit on the memory used by MongoDB artificially reduces the
> overall system capacity. But, it creates a more reliable early warning
> with a fast path to a temporary fix. The temporary fix will usually
> last long enough to get new shards online.

Theoretically you could lower that aforementioned dirty percentage
using the vm tuning tools. In linux, /proc/sys/vm/... If the working
set is diryting pages beyond the thresholds, then the os will start
flushing to disk and mongo will get slow, you could simply raise the
vm thresholds and get this temporary runway.

diptamay

unread,
Oct 8, 2010, 1:16:15 PM10/8/10
to mongodb-user
MongoDB uses range based partitioning of its shards. So even if some
power users were involved in doing loads of check-ins, what I don't
understand is why didn't MongoDB sharding split the user ranges
further and swap out the chunks and migrate to the other shard. Any
thoughts?

On Oct 7, 6:05 pm, Eliot Horowitz <[email protected]> wrote:

Eliot Horowitz

unread,
Oct 8, 2010, 1:19:53 PM10/8/10
diptamay: see footnote 1.

Markus Gattol

unread,
Oct 8, 2010, 2:03:13 PM10/8/10
to mongodb-user
On Oct 8, 6:16 pm, diptamay <[email protected]> wrote:
> MongoDB uses range based partitioning of its shards. So even if some
> power users were involved in doing loads of check-ins, what I don't
> understand is why didn't MongoDB sharding split the user ranges
> further and swap out the chunks and migrate to the other shard. Any
> thoughts?

Read the "What we missed in the interim" again and also [1] which is
referenced from it. In a nutshell: even if you have the same number of
chunks on each shard, that does not mean that the data set size per
shard is the same on each shard. This lead to a situation where one
shard didn't have enough RAM left which then started a vicious
cycle ...

Jonathan Ultis

unread,
Oct 8, 2010, 2:37:26 PM10/8/10
to mongodb-user
I like that thought. Yeah, mlocking memory will only provide a runway
for read throughput problems due to the working set size exceeding
memory. If the problem was write throughput...

Let's see, increasing the dirty_ratio would let the OS hold more dirty
pages in memory. That won't change the total write throughput by
itself. But, it might allow more edits to land on already dirty pages.
If the edits end up merging, then that would reduce the amount of
write throughput needed. Hopefully. Is that the idea?

Are you sure the root problem was write throughput? That doesn't fit
entirely with the explanation of why adding a new shard didn't fix the
problem. Moving blocks to a new machine should reduce the write rate
on the original shard immediately, even if there is fragmentation that
prevents the working set size from shrinking.

Isn't it more likely that the working set size for read exceeded
memory, causing reads to hit disk sometimes? The new read load was big
enough that moving a percentage of the write load onto a new shard did
not fix the IO contention on the original machine. Increasing the
available RAM would have removed the read IO entirely, fixing the
problem and creating a runway.


On Oct 8, 10:14 am, David Birdsong <[email protected]> wrote:

David Birdsong

unread,
Oct 8, 2010, 6:23:38 PM10/8/10
On Fri, Oct 8, 2010 at 11:37 AM, Jonathan Ultis
<[email protected]> wrote:
> I like that thought. Yeah, mlocking memory will only provide a runway
> for read throughput problems due to the working set size exceeding
> memory. If the problem was write throughput...
>
> Let's see, increasing the dirty_ratio would let the OS hold more dirty
> pages in memory. That won't change the total write throughput by
> itself. But, it might allow more edits to land on already dirty pages.
> If the edits end up merging, then that would reduce the amount of
> write throughput needed. Hopefully. Is that the idea?
sounds good to me.

>
> Are you sure the root problem was write throughput? That doesn't fit
> entirely with the explanation of why adding a new shard didn't fix the
> problem. Moving blocks to a new machine should reduce the write rate
> on the original shard immediately, even if there is fragmentation that
> prevents the working set size from shrinking.
>
> Isn't it more likely that the working set size for read exceeded
> memory, causing reads to hit disk sometimes? The new read load was big
> enough that moving a percentage of the write load onto a new shard did
> not fix the IO contention on the original machine. Increasing the
> available RAM would have removed the read IO entirely, fixing the
> problem and creating a runway.
>

very possibly yes. if i understand the VM, linux at least, reading
all over an mmap'd space does count toward the dirty page ratio, but
perhaps are bounded by all of available RAM (swap too?). if one reads
beyond what's available, then other pages are evicted. i can't seem
to find any reference to describe which pages are chosen for eviction,
perhaps LRU -that's just a guess.

to provide the runway you proposed, using the dirty page settings
would not help in the case of heavily randomized reads i guess.
another way to do the mlock safe guard is to set the the
min_free_kbyes setting to include this runway and simply remove that
runway amount when things get hairy enough.

i know ssd drives were already brought up, but if the problem was in
fact random reads and not necessarily writes, ssd's could have helped.
my guess is that it was probably both.

Ethan Whitt

unread,
Oct 8, 2010, 8:50:30 PM10/8/10
Fusion-IO is another option

http://www.fusionio.com/products/

Ron Bodkin

unread,
Oct 11, 2010, 12:05:45 PM10/11/10
to mongodb-user
I've enjoyed reading the thread. Like Alex I'm not a MongoDB expert,
but am trying to understand what happened better and what can be done
differently, I'd love to know:

1) It sounds like all the database is in the working set. I would have
guessed that only a small fraction of all historical check-ins are
likely to be read (<1%), and that old check-ins are likely to be
clustered on pages that aren't read often, so that is surprising to
me. Can you say what fraction of the objects are likely to be read in
a given hour and what the turnover of working set objects is?

2) Later in the thread Eliot noted:
MongoDB should be better about handling situations like this and
degrade much more gracefully.
We'll be working on these enhancements soon as well.

What kind of enhancements are you planning? Better VM management? Re-
custering objects to push inactive objects into pages that are on
disk? A paging scheme like Redis uses (as described in
http://antirez.com/post/what-is-wrong-with-2006-programming.html)?

3) From the description, it seems like you should be able to monitor
the resident memory used by the mongo db process to get an alert as
shard memory ran low. Is that viable? If not, what can be done to
identify cases where the working set is approaching available? It
seems like monitoring memory/working sets for mongo db instance would
be a generally useful facility - are there plans to add this
capability? What's the best practice today?

4) With respect to using SSD's - what is the write pattern for pages
that get evicted? If they are written randomly using OS paging, I
wouldn't expect there to be a benefit from SSD's (you wouldn't be able
to evict pages fast enough), although if mongodb were able to evict
larger chunks of pages from RAM so you had fewer bigger random writes,
with lots of smaller random reads, that could be a big win.

Thanks,
Ron

Eliot Horowitz

unread,
Oct 11, 2010, 6:05:59 PM10/11/10
Answers below

> 1) It sounds like all the database is in the working set. I would have
> guessed that only a small fraction of all historical check-ins are
> likely to be read (<1%), and that old check-ins are likely to be
> clustered on pages that aren't read often, so that is surprising to
> me. Can you say what fraction of the objects are likely to be read in
> a given hour and what the turnover of working set objects is?

I can't give a lot of details here, but a very high percentage of
documents were touched very often.
Much higher than you would expect.

> 2) Later in the thread Eliot noted:
> MongoDB should be better about handling situations like this and
> degrade much more gracefully.
> We'll be working on these enhancements soon as well.
>
> What kind of enhancements are you planning? Better VM management? Re-
> custering objects to push inactive objects into pages that are on
> disk? A paging scheme like Redis uses (as described in
> http://antirez.com/post/what-is-wrong-with-2006-programming.html)?

The big issue really is concurrency. The VM side works well, the
problem is a read-write lock is too coarse.
1 thread that casues a fault can a bigger slowdown than it should be able to.
We'll be addressing this in a few ways: making yielding more
intelligent, real intra-collection concurrency.

> 3) From the description, it seems like you should be able to monitor
> the resident memory used by the mongo db process to get an alert as
> shard memory ran low. Is that viable? If not, what can be done to
> identify cases where the working set is approaching available? It
> seems like monitoring memory/working sets for mongo db instance would
> be a generally useful facility - are there plans to add this
> capability? What's the best practice today?

The key metric to look at is disk operations per second. If this
starts trending up, that's a good warning side.
Memory is a bit hard because some things don't have to be in ram
(transaction log) but will be if nothing else is using it.

> 4) With respect to using SSD's - what is the write pattern for pages
> that get evicted? If they are written randomly using OS paging, I
> wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> to evict pages fast enough), although if mongodb were able to evict
> larger chunks of pages from RAM so you had fewer bigger random writes,
> with lots of smaller random reads, that could be a big win.

Not sure I follow. The random writes/reads are why we think SSDs are
so good (and in our testing).

-Eliot

> Thanks,
> Ron
>
> On Oct 7, 7:32 pm, harryh <[email protected]> wrote:
>> Just as a quick follow-up to Eliot's detailed message on our outage I
>> want to make it clear to the community how helpful Eliot and the rest
>> of the gang at 10gen has been to us in the last couple days.  We've
>> been in near constant communication with them as we worked to repair
>> things and end the outage, and then to identify and get to work on
>> fixing things so that this sort of thing won't happen again.
>>
>> They've been a huge help to us, and things would have been much worse
>> without them.
>>
>> Further, I am very confident that some of the issues that we've
>> identified in Mongo will be dealt with so others don't encounter
>> similar problems.  Overall we still remain huge fans of MongoDB @
>> foursquare, and expect to be using it for a long time to come.
>>
>> I'm more than happy to help answer any concerns that others might have
>> about MongoDB as it relates to this incident.
>>
>> -harryh, eng lead @ foursquare
>

Ron Bodkin

unread,
Oct 14, 2010, 11:10:06 PM10/14/10
to mongodb-user
I wanted to follow up on one point here:

On Oct 11, 6:05 pm, Eliot Horowitz <[email protected]> wrote:
> > 4) With respect to using SSD's - what is the write pattern for pages
> > that get evicted? If they are written randomly using OS paging, I
> > wouldn't expect there to be a benefit from SSD's (you wouldn't be able
> > to evict pages fast enough), although if mongodb were able to evict
> > larger chunks of pages from RAM so you had fewer bigger random writes,
> > with lots of smaller random reads, that could be a big win.
>
> Not sure I follow.  The random writes/reads are why we think SSDs are
> so good (and in our testing).

SSD's generally have equally poor random writes compared to spinning
disk - it's only for random reads where they have a latency advantage.
If VM paging is resulting in lots of random writes then I wouldn't
expect SSD's to help. But if there a lot more random reads, it could
help a lot (e.g., if the system can just discard pages that are stale
but have no modifications, rather than writing them to disk through
paging).

Ron

Dwight Merriman

unread,
Oct 17, 2010, 5:30:32 PM10/17/10
perhaps use a log structured file system like jffs2?

Reply all
Reply to author
Forward
0 new messages