Logging yourself to death
One of our clients had opened a support incident. Under certain conditions, RavenDB will issue so much I/O requests that it will effectively lock the I/O system. The condition was cold boot when you have many hundreds of databases on the server, and you force all of them to start immediately. We applied a policy to rate limit the number of concurrently loading databases, sent a hotfix, and was done.
Everyone was happy, there were birds singing, babies chasing puppies and long walks on the beach.
Then the customer called back, “ever since we applied the hotfix, everything became slower”. A code review of the changes by two different team members uninvolved in the original hotfix produced similar conclusion, the affected code was only on the db load code path, and there was no possible way it could impact requests. Attempts to reproduce this locally has been unsuccessful.
Working with the customer, we managed to figure out that the problem was related to high CPU usage, which also ruled out the hotfix, since that would have been slowness because of a lock. Digging further, we discovered that most of the time was spent in GC work. That gave us a good handle to start digging. We discovered that on the production system, RavenDB was allocating roughly 4GB / second.
That is… a lot of memory to allocate. And the size of the Gen2 heap was really high.
But what was causing all of this angst?
As it turned out, during the previous support incidents, the log level has been changed to use Debug. And we output a lot to the log. Now, while that is pretty obvious, and easily resolved, we dug down deeper and found out exactly what was causing this.
To understand what is going on, you have to remember the current situation, we are on a busy server, hundreds of databases, and a log level set to debug. The logs were writing to the same disk as the databases, which happened to be on a remote SAN. Combined with the I/O pressure from all of those databases, that meant that I/O writes to the log could take quite a bit of time.
Now, RavenDB is configured to use async logging exactly for this purpose. So we’ll not pay for slow I/O in the logs. And NLog is a mature project, which already considered the case of flooding and potential memory leak as a result of too many log messages.
This is implemented in the following method:
public void Enqueue(AsyncLogEventInfo logEventInfo) { lock (this) { if (this.logEventInfoQueue.Count >= this.RequestLimit) { InternalLogger.Debug("Async queue is full"); switch (this.OnOverflow) { case AsyncTargetWrapperOverflowAction.Discard: InternalLogger.Debug("Discarding one element from queue"); this.logEventInfoQueue.Dequeue(); break; // other options removed for brevity } } this.logEventInfoQueue.Enqueue(logEventInfo); } }
The log is configured to use Discard on overflow, which is why I’m showing only this code.
I wasn’t aware that NLog just took a lock like that, but that isn’t what bothering me. What this code is doing is saying: “Okay, I run to the limit, let us discard the stuff that is in the queue the longest time”.
What this effectively did was to keep references to the log entries in memory, just long enough for them to hit Gen2, and then just discard them. In the meantime, any new entry would go into the queue, where it would mature into Gen2, but likely will be discarded as well.
There is a word for it, and it goes like this:
Comments
How do you end up "fixing" something like that?
Daniel, Reducing the log level will reduce the load on the logging, while also reduce the I/O load. In the long term, we'll have our own logging infrastructure
So I guess there is no chance of contributing to the NLog project to improve it instead of creating yet another logging framework?
If you log asynchronously by default, it means that when a sudden shutdown occurs you can loose log entries. How do you deal with it? Isn't it limiting the accuracy of your logs & removing some capabilities for reasoning of disgraceful shutdowns?
Eric, I'll answer that a separate post, it is too big for a comment
Scooletz, Errors like that are almost always a process ending crash. That give us errors at the OS level, for example, from the event log
In one of the systems I've worked on I've added red background for the management UI Title Bar when logging and tracing was set to verbose to warn the users because it could generate logs in excess of 300 GB per day. Maybe adding visual cues like this when Degug settings are enabled would prevent such mistakes in the future?
Catalin, One of the problems we have is that we have separate configuration for the logging, which make it hard to find what is the current log level. That makes such a feature a bit hard to do. One of the things we'll look at is this kind of feature, yes
You are doing some very interesting debugging there!
Btw, did you mean "let us discard the stuff" instead of "let us start the stuff"?
Konstantins, Yes, that was a typo. Fixed
Comment preview