Logging yourself to death

time to read 4 min | 753 words

One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.

Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.

Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.

Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.

That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.

But what was causing all of this angst?

As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.

To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN.  Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.

Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.

This is implemented in the following method:

public void Enqueue(AsyncLogEventInfo logEventInfo)
{
    lock (this)
    {
        if (this.logEventInfoQueue.Count >= this.RequestLimit)
        {
            InternalLogger.Debug("Async queue is full");
            switch (this.OnOverflow)
            {
                case AsyncTargetWrapperOverflowAction.Discard:
                    InternalLogger.Debug("Discarding one element from queue");
                    this.logEventInfoQueue.Dequeue();
                    break;
                // other options removed for brevity
            }
        }

        this.logEventInfoQueue.Enqueue(logEventInfo);
    }
}

The log is configured to use Discard on overflow, which is why I’m showing only this code.

I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.

What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.

There is a word for it, and it goes like this: