flume: All content tagged as flume in NoSQL databases and polyglot persistence

Friday, 25 January 2013

Apache Flume Performance Tuning

About:

Flume

Databus

A lot of apps get to ship logs and while there are probably numerous tools to help with this, Apache Flume¹ is the one I’d look first (even if for taking inpiration on how to do things):

An important decision to make when designing your Flume flow is what type of channel you want to use. At the time of this writing, the two recommended channels are the file channel and the memory channel. The file channel is a durable channel, as it persists all events that are stored in it to disk. So, even if the Java virtual machine is killed, or the operating system crashes or reboots, events that were not successfully transferred to the next agent in the pipeline will still be there when the Flume agent is restarted. The memory channel is a volatile channel, as it buffers events in memory only: if the Java process dies, any events stored in the memory channel are lost. Naturally, the memory channel also exhibits very low put/take latencies compared to the file channel, even for a batch size of 1. Since the number of events that can be stored is limited by available RAM, its ability to buffer events in the case of temporary downstream failure is quite limited. The file channel, on the other hand, has far superior buffering capability due to utilizing cheap, abundant hard disk space.

Just a couple of extra-thoughts:

Flume NG seems to offer 3 types of channels: file, jdbc, memory.
For the memory channel, I’d be adding an option to start dropping events if the memory consumption goes above a configurable threshold (this might already be implemented, but I couldn’t find it)
Would it be worth investigating a channel based on LinkedIn’s low latency transfer Databus tool?