Running TensorFlow (with GPU) on Kubernetes
While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes cluster. This quick guide will walk through adding basic single-GPU support to Kubernetes.
The guide assumes that Kubernetes is already running on Ubuntu. A LTS release is preferable, with 14.04 being most preferable due to NVIDIA recommendations for driver hosts. Warning: Ubuntu 14.04 is not well supported by Kubernetes. Feel free to use a different distro. This guide also assumes that the proper GPU drivers and CUDA version have been installed. Plenty of other guides cover those topics.
TL;DR: start with nvidia-docker, then whittle away it’s functionality so that just plain docker remains. Then add that functionality to Kubernetes.
Working without nvidia-docker
A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.
Simple! If all goes well the output should look something like this:
Unfortunately it’s not current possible to use nvidia-docker directly from Kubernetes. Additionally, Kubernetes does not support the nvidia-docker-plugin since Kubernetes does not use Docker’s volume mechanism.
The goal is to manually replicate the functionality provided by nvidia-docker (and it’s plugin). For demonstration, query the nvidia-docker-plugin REST API to query the command line arguments:
Which will feed into docker, running the same python command:
If all does well, TensorFlow should find everything correctly and you should see the same output as before.
Finally, the dependency on nvidia-docker-plugin by manually specifying the driver path and manually mounting the devices and CUDA volumes.
Note that this still uses nvidia-docker’s driver volume for discovery. While Kubernetes cannot call the plugin directly we can use the filesystem.
Enabling GPU devices
With the knowledge of what Docker needs to be able to run a GPU-enabled container it is straightforward to add this to Kubernetes. The first step is to enable an experiment flag on all of the GPU nodes. In the Kubelet options (found in /etc/default/kubelet if you use upstart for services), add --experimental-nvidia-gpus=1
. This does two things… First, it allows GPU resources on the node for use by the scheduler. Second, when a GPU resource is requested, it will add the appropriate device flags to the docker command. This post describes a little more about what and why this flag exists:
http://blog.clarifai.com/how-to-scale-your-gpu-cloud-infrastructure-with-kubernetes
The full GPU proposal, including the existing flag and future steps can be found here:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md
Pod Spec
With the device flags added by the experimental GPU flag the final step requires adding the necessary volumes to the pod spec. A sample pod spec is provided below:
If set up correctly the output should match the output from running the nvidia-docker container output at the beginning:
Conclusion
Hopefully this guide helps someone wade through these undocumented features to make use of GPUs in their cluster.
Follow me on Twitter for more posts like these. We also do applied research to solve machine learning challenges.