Balancing Workload Across Nodes with Akka 2

In Akka 2, there is a nifty little thing called the BalancingDispatcher, which will magically distribute work to a collection of Actors in the most efficient way possible (i.e. it’s a work stealing dispatcher). The SmallestMailboxRouter has this kind of feel as well. However, the BalancingDispatcher and the SmallestMailboxRouter differ in how the choice is made about who, and when to deliver the incoming message. The BalancingDispatcher dispatches a message to an Actor only when that Actor would otherwise be idle. The SmallestMailboxRouter dispatches the message to the Actor with the least number of messages in its Mailbox even while the Actor is currently working on other things.

Often times, people want the functionality of the BalancingDispatcher with the stipulation that the Actors doing the work have distinct Mailboxes on remote nodes. In this post we’ll explore the implementation of such a concept.

I’m going to assume you understand Akka to a reasonably competent level. Most people don’t get to the point of asking how to do something like this without understanding a fair bit about Akka in the first place, so I’m going to pretend that my assumption is reasonable. :)

The BalancingDispatcher

First, take a quick look at the BalancingDispatcher. Essentially, it works by sharing a single Mailbox amongst all of its Actors. It doesn’t really “steal” work from its Actors; it just intelligently distributes that work by ensuring that the next message gets sent to an idle Actor, and if one isn’t available, then the message doesn’t get sent until that changes.

Fig 1: The BalancingDispatcher distributes work to Actors evenly by having them all share the same Mailbox.

The SmallestMailboxRouter

The SmallestMailboxRouter differs because the Actors to which it routes all have distinct Mailboxes. So, while the BalancingDispatcher makes its decision on which Actor gets the message at message processing time, the SmallestMailboxRouter makes its decision on which Actor gets the message at message delivery time.

Fig 2: The SmallestMailboxRouter sends messages to Actors immediately, regardless as to how much time they may send working on their messages in the future. The key is that it sends the message to the Actor with the smallest Mailbox.

The Goal

We’re going to rely on two key components: the Master and Worker. The Master will help coordinate the work and the Worker will execute on that work. The qualities of the pattern are as follows:

Implement the fundamental concept of the BalancingDispatcher- that which distributes work only to those that can work on it immediately - while still having distinct Mailboxes per Worker.
- We must have distinct Mailboxes per Worker since the Workers are on remote nodes and if they didn’t have Mailboxes then there would be nowhere to deliver their messages.
Preserve the original sender of the work. This allows the Worker to respond to the original sender when the work is complete.
Provide a mechanism for the Master to detect death of a Worker and reassign the work if need be.
Allow the capacity of the system to increase by the addition of Workers dynamically.

The Pattern Overview

In order to deliver on our goal we are going to invert the work-sending paradigm of the BalancingDispatcher. The Master will not push work to the Workers; instead, the Workers will pull that work from the Master on-demand. There are a number of advantages to inverting this message flow, as we’ll see.

Fig 3: Ultimately this is what we’re going to achieve. The Master distributes work to Workers only when Workers request it. The Workers will inform the Master that they’re finished their work, but the Master will not push work their way until they make another request.

One of the key points to remember about this pattern is that it’s entirely event driven with no polling. Message number 2 might make it look like there is polling, but you can rest assured that it isn’t - message number 2 is sent as the result of a previous event, not a timer.

The Master and its Workers

In order for the pattern to be valid we must have concrete instances of the Master (on some node) and the Workers (on the same node, or other nodes). The relationship between the two is key: Workers identify themselves to the Master - the Master doesn’t spawn Workers. A variant on this pattern would be to have the Master do exactly that, but this definition does not include that notion.

The caveat, therefore, is that the Master must be running before the Workers try to register with it, and the Workers must be aware of the Master’s location. Another variant could use failure / retry handling in the Workers to ease the need for this startup ordering, but that’s a detail we’re not concerned with here.

Fig 4: The Master only learns of the existence of the Workers when they advertise themselves.

This constraint opens up the flexibility of the system:

There’s no clunky configuration of the Master required. We don’t need to “reconfigure it” every time we want to add a new Worker.
You can increase capacity dynamically. If you need more Workers, just instantiate them and they’ll get work assigned to them as soon as it becomes available.

This is essentially how Hadoop’s MapReduce system works, and it’s the same way many systems before it have worked as well.

The Master

Now that we know what the pattern looks like we can describe the operation of the Master. For the most part, our Master will take on the role of the BalancingDispatcher’s single Mailbox. It will collect up work to be done, and distribute it as required amongst the Workers, but only at the moment when those Workers can actually work on it.

Fig 5: The Master sends work to Workers only when they ask for it. Workers are not polling for work - they ask for it when they’re finished doing their current job, or they’ve been told that pending work is available.

Above we see that “Worker 3” is in a position to ask for work, while the others are busy. The Master will give that work out to “Worker 3” and hold on to the rest, waiting until another Worker (or it may actually end up being “Worker 3” again) asks for work.

The Protocol

We start with the message protocol that passes between the Master and its Workers. The following defines both messages sent to and messages sent from the Master.

The Master Code

Next we see the master code. Its job is to queue work sent to it from the outside world, and handle the three messages that can be sent from the Workers. It also puts Deathwatch on the Workers in order to take action should any of those Workers croak.

The Worker

The Worker’s job is to switch between two states: idle and working. When it’s idle, it’s looking for work and when it’s working it’s trying to go idle. The Worker responds to messages differently depending on what state it’s currently in.

We implement the Worker as an abstract base class, from which you would derive and implement the doWork() method. It’s intended that this doWork() method do its job asynchronously but that’s not a requirement. However, given how simple it is to spawn of work in a Future with Akka, why wouldn’t you?

Implementing a Worker

In order to use the Worker, we need to create a derivation. For the purposes of this example, and testing, we’re going to create a Worker that sends a string to the original work requester.

Note how we’ve spawned the work off into its own Future that will execute asynchronously. When it’s complete it will inform the Worker of that fact using the WorkComplete message.

Fig 6: This is really how you should implement your doWork() method. Send the work off to the future, which will allow your Worker to remain responsive to the Master.

Testing

And we now verify that everything works by implementing a test.

Possible Extensions / Modifications

What’s been presented is merely a pattern as instruction on the mechanism for achieving a BalancingDispatcher-style work distribution system across multiple nodes. There are a number of variants you could make of this pattern.

Forget the whole “idle” thing. When a Worker sees that there’s no more work to be done, it can stop itself (i.e. context.stop(self)). Given the current implementation, the Master would detect this and remove it from the list of possible candidates.
Remove the side-effects. Have a tighter relationship between the Master and its Workers; the Worker pulls work from the Master and actually returns results back to the Master. Personally I can’t see a reason to do this.
Have the Master spawn the children as needed. This would create them as direct children, which is certainly different than what we have here but if that’s what you want to do then do it. The pull semantics don’t change, however. You still pull - you just create the Worker as part of the Master logic.
Any retry-logic that you might want would have to live in the Master, but idempotency and retry are way beyond the scope of this article. You can implement retry in the Master and / or dovetail this concept into the Clustering features that are coming up in the 2.x stream of Akka.
If you’re smart, and implement doWork() in the future, then you can have the Worker be responsive to the Master’s requests for status (for example).

Usage Example

Let’s take a different look at how you might use this, just to illustrate that it’s still a very general implementation. If you want to download a ton of pages in parallel across a huge farm of nodes, collect them up, and then dump them to third party (say, a page renderer) when they’re all in… piece of cake.

Conclusion

The key to this pattern isn’t the implementation or anything fancy. What’s important is the inversion of what, for some, is their natural inclination. When people think of an on-demand, event-based system, they think push, but push doesn’t work here. You can’t push to an Actor unless you know it will be available to do your work, and since you can’t peek into its Mailbox, you’re not going to be able to find out.

Alternatively you might think of implementing a model where the Master can ask the Worker what it’s doing and push work it’s way when it says it’s idle. This is a bad idea for a couple of reasons.

What if the Worker says “I’m busy, leave me alone”? When can the Master ask again?
- It’s starting to look like polling, and polling is going to slow things down a lot.
What if the Worker says “I’m idle”, but one millisecond later it becomes busy?
- The Master will send it work to do, and it won’t do that work for (possibly) a very long time.

You can try to work around those issues by setting up some sort of transactional nature between the Master and the Worker - something like, “If you’re not busy, tell me you won’t do anything else until I send you something”. But why? Invert from push to pull, and none of these problems exist. With pull, you could even have a single Worker service many Masters without conflict, should you choose to do so.

The bones of this pattern should give you highly reactive, fast, and efficient use of your Workers across multiple nodes. Implement and alter to suit your needs, and then collect your profit!

About the Author

Derek Wyatt is a Software Developer and Architect at Primal, helping to create Interest Networks using graphs and other cool stuff. He is also the author of an upcoming book on Akka to be published by Artima Publishing. You should go get a copy when it comes out… really. It’ll have pictures! Nothing dirty, though…