MongoDB Replica Sets with Kubernetes
This blog post is now outdated! Please check out the updated version
http://blog.kubernetes.io/2017/01/running-mongodb-on-kubernetes-with-statefulsets.html
If you followed my last two posts (1, 2) you successfully set up and ran a MEAN stack with Docker containers and made it scalable with Kubernetes. In the end, we ended up with a service stack that looks like this:
Today, I want to focus on the MongoDB database layer of the stack. I assume you are familiar with core Kubernetes concepts; if not you should check out my previous blog posts.
With Kubernetes, the database system is more robust than your standard single instance MongoDB setup. If the MongoDB instance dies, Kubernetes will automatically spin up a new one.
However, this means we will have a few seconds to minutes of downtime while a new pod spins up. Also, if the data gets corrupted (which is unlikely but possible), there is no backup.
You can solve this problem by running a MongoDB Replica Set. But how do you do it on Kubernetes?
It is easy to scale up the web pods with a simple command:
$ kubectl scale rc web — replicas=5
Scaling the database, however, is much trickier.
Note: If you are interested in MySQL, check out Vitess, an OSS project from YouTube that lets you run an auto-sharded MySQL cluster on Kubernetes.
Why is scaling the database difficult?
If you want to just learn how to do it, skip this section.
Unlike the web pod, the database pod has a gcePersistentDisk attached to it. If you try to scale the replication controller, all the pods will try to mount the same disk, which is not possible. Even if it was possible (using a network drive), it would not be a good thing, as data corruption in one instance would destroy the whole set, and I/O speeds would be really slow.
Each MongoDB pod needs its own gcePersistentDisk, which means each one needs it own replication controller. Each pod also needs to find the other members of the replica set and register itself. And all of this needs to be automatic.
Step 0: Cleaning up (Optional)
If you have the replication controllers and services from the previous blog post still running, either make a new cluster or delete them.
Step 1: Creating the Replica Set
Normally, you would follow these docs to create a replica set. This includes logging into each instance, running a few commands, and hard-coding IP addresses, none of which is good in a Kubernetes cluster.
Fortunately, you don’t have to do any of that.
We are going to use a “sidecar” to automatically configure our replica set. As you know, a pod can have more than one container inside it. A sidecar is a container that helps out the “main” container. In this case, our sidecar will automatically discover and configure our replica set and configure the main MongoDB container.
The replica set will be called:
rs0
First download the code:
$ git clone https://github.com/leportlabs/mongo-k8s-sidecar.git
Now go to the examples directory:
$ cd mongo-k8s-sidecar/example/
There you’ll find a Makefile that will manage the MongoDB replica set.
To find out how many MongoDB replicas are currently in the cluster, run:
$ make count
And you should see the following output:
Current Number of MongoDB Replicas: 0
If you don’t see this, make sure your kubectl tool is working.
Now, let’s add one replica:
$ make add-replica DISK_SIZE=200GB ZONE=us-central1-f
Replace us-central1-f with the zone of your cluster.
This will create a new disk to store your data, then launch a replication controller that will spin up a MongoDB instance and a service that maps to this new instance.
Now, run:
$ make count
And you should see the following:
$ Current Number of MongoDB Replicas: 1
Sweet! Now create two more instances (just run the same make add-replica command again). Best practices call for an odd number of replicas, three is the standard number.
Now you have a three-instance replica set up and running!
To delete a replica, simply run:
$ make delete-replica
Always use this Makefile to create and destroy replicas unless you know exactly what you are doing.
Step 2: Connecting to the Replica Set
To connect to the replica set, make sure you set “slaveOk: true” and “replicaSet: ‘rs0’ ”. See the options variable (line 51) in the Gist below for more details.
In the case where we had a single MongoDB instance, we made a Service called ‘mongo’ that mapped to the pod, and used that service in our connection string, e.g.:
var url = ‘mongodb://mongo:27017/dbname_?’;
If we try the same thing again, it won’t work. The Kubernetes Service will only give us an IP address of a random MongoDB pod, not all of them. According to the MongoDB documentation, the other members of the replica set should automatically be found given one IP address, but unfortunately this does not seem to be the case.
There are two options for fixing this problem.
Option 1 (simple):
By default, the Makefile will create a Service for each MongoDB instance. If we have three instances, just change the host portion of the connection string to have all three services.
For example:
$ var url = ‘mongodb://mongo-1,mongo-2,mongo-3:27017/dbname_?’;
This is the easiest way to do it.
Option 2 (complicated):
The problem with option one is that you have to create a new service for each MongoDB instance, and every time you add more replicas to your replica set, you need to make a code change! For 90% of use cases this probably won’t be a problem, as you don’t add replicas very often.
This does not feel like the Kubernetes or microservices way of doing things. You should not have to know beforehand how many replicas you are running; that should be abstracted away.
So, instead of using the services, we can ping the Kubernetes API to get the actual pod IP addresses of all the MongoDB instances. To do this, I wrote a simple microservice:
$ git clone https://github.com/thesandlord/kubernetes-pod-ip-finder.git
$ cd kubernetes-pod-ip-finder
To launch the microservice:
$ kubectl create -f service.yaml
$ kubectl create -f controller.yaml
This creates a simple REST endpoint you can query for the pod IP addresses.
Option one is static and not very flexible, but option two has a performance overhead and requires custom code. You can always start with option one, and move to option two if you need more flexibility.
Here is a sample app that uses this service to get the IP addresses, check out line 66–72 to see how I use this microservice to get the IP addresses:
Closing Thoughts
With this, you now have a much more production-ready MongoDB setup compared to before! In my testing, even shutting down a VM doesn’t affect anything; Kubernetes just reschedules the pod and volume to another machine! If the primary replica goes down, the MongoDB replica set takes 10 seconds to elect a new primary, but that’s about the worst downtime you might experience.
Also, most of my sample code only works in Google Cloud Platform. I would love to get PRs adding support for other clouds and on-prem deployments. Thanks cvallance for merging my sidecar PR. Special thanks to Stephen for all your help — I really hope to see a similar blog post about using Flocker instead of gcePersistentDisks to manage your MongoDB volumes so it even works on-prem!
If you have any questions, send me a tweet @sandeepdinesh.