Juju: Enable GPU mode if GPU hardware detected #43467

tvansteenburgh · 2017-03-21T18:18:07Z

What this PR does / why we need it:

Automatically configures kubernetes-worker node to utilize GPU hardware when such hardware is detected.

layer-nvidia-cuda does the hardware detection, installs CUDA and Nvidia
drivers, and sets a state that the k8s-worker can react to.

When gpu is available, worker updates config and restarts kubelet to
enable gpu mode. Worker then notifies master that it's in gpu mode via
the kube-control relation.

When master sees that a worker is in gpu mode, it updates to privileged
mode and restarts kube-apiserver.

The kube-control interface has subsumed the kube-dns interface
functionality.

An 'allow-privileged' config option has been added to both worker and
master charms. The gpu enablement respects the value of this option;
i.e., we can't enable gpu mode if the operator has set
allow-privileged="false".

Special notes for your reviewer:

Quickest test setup is as follows:

# Bootstrap. If your aws account doesn't have a default vpc, you'll need to
# specify one at bootstrap time so that juju can provision a p2.xlarge.
# Otherwise you can leave out the --config "vpc-id=vpc-xxxxxxxx" bit.
juju bootstrap --config "vpc-id=vpc-xxxxxxxx" --constraints "cores=4 mem=16G root-disk=64G" aws/us-east-1 k8s

# Deploy the bundle containing master and worker charms built from
# https://github.com/tvansteenburgh/kubernetes/tree/gpu-support/cluster/juju/layers
juju deploy cs:~tvansteenburgh/bundle/kubernetes-gpu-support-3

# Setup kubectl locally
mkdir -p ~/.kube
juju scp kubernetes-master/0:config ~/.kube/config
juju scp kubernetes-master/0:kubectl ./kubectl

# Download a gpu-dependent job spec
wget -O /tmp/nvidia-smi.yaml https://raw.githubusercontent.com/madeden/blogposts/master/k8s-gpu-cloud/src/nvidia-smi.yaml

# Create the job
kubectl create -f /tmp/nvidia-smi.yaml

# You should see a new nvidia-smi-xxxxx pod created
kubectl get pods

# Wait a bit for the job to run, then view logs; you should see the
# nvidia-smi table output
kubectl logs $(kubectl get pods -l name=nvidia-smi -o=name -a)

kube-control interface: https://github.com/juju-solutions/interface-kube-control
nvidia-cuda layer: https://github.com/juju-solutions/layer-nvidia-cuda
(Both are registered on http://interfaces.juju.solutions/)

Release note:

Juju: Enable GPU mode if GPU hardware detected

k8s-reviewable · 2017-03-21T18:18:12Z

This change is

k8s-ci-robot · 2017-03-21T18:18:14Z

Hi @tvansteenburgh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

marcoceppi · 2017-03-21T18:26:09Z

@k8s-bot ok to test

marcoceppi · 2017-03-21T18:33:25Z

cluster/juju/layers/kubernetes-master/lib/charms/kubernetes/common.py

@@ -0,0 +1,48 @@
+import re


Make sure we have the right headers in place for these new files

Ahh, right. Good catch, will fix.

marcoceppi · 2017-03-21T19:00:49Z

/approve

lazypower · 2017-03-23T15:32:40Z

@k8s-bot cvm gce e2e test this

layer-nvidia-cuda does the hardware detection and sets a state that the worker can react to. When gpu is available, worker updates config and restarts kubelet to enable gpu mode. Worker then notifies master that it's in gpu mode via the kube-control relation. When master sees that a worker is in gpu mode, it updates to privileged mode and restarts kube-apiserver. The kube-control interface has subsumed the kube-dns interface functionality. An 'allow-privileged' config option has been added to both worker and master charms. The gpu enablement respects the value of this option; i.e., we can't enable gpu mode if the operator has set allow-privileged="false".

cmluciano · 2017-03-24T21:00:09Z

cluster/juju/layers/kubernetes-worker/reactive/kubernetes_worker.py

+    # Not sure why this is necessary, but if you don't run this, k8s will
+    # think that the node has 0 gpus (as shown by the output of
+    # `kubectl get nodes -o yaml`
+    check_call(['nvidia-smi'])


Can you open a bug for this and cc @cmluciano

cmluciano · 2017-03-24T21:04:52Z

cluster/juju/layers/kubernetes-worker/reactive/kubernetes_worker.py

+    remove_state('kubernetes-worker.kubelet.restart')
+
+
+@when('cuda.installed')


Is this the only check for detecting if GPUs are exposed? I may be missing logic somewhere else.

Yeah, the actual logic for detection is in a separate layer that this charm uses (https://github.com/juju-solutions/layer-nvidia-cuda). If GPUs are detected, the nvidia-cuda layer sets the 'cuda.installed' state so other layers can react to it.

lazypower · 2017-03-30T12:02:10Z

+1 LGTM

/approve

lazypower · 2017-04-04T16:54:37Z

/lgtm

k8s-github-robot · 2017-04-04T16:55:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckbutler, marcoceppi, tvansteenburgh

Needs approval from an approver in each of these OWNERS Files:

~~cluster/juju/OWNERS~~ [chuckbutler,marcoceppi]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-04-04T21:33:25Z

Automatic merge from submit-queue (batch tested with PRs 44047, 43514, 44037, 43467)

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 21, 2017

k8s-github-robot assigned lazypower Mar 21, 2017

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 21, 2017

k8s-github-robot added the release-note-label-needed label Mar 21, 2017

tvansteenburgh force-pushed the gpu-support branch from 9bb0e37 to 35b6f1b Compare March 21, 2017 18:22

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2017

k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Mar 21, 2017

marcoceppi reviewed Mar 21, 2017

View reviewed changes

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2017

tvansteenburgh force-pushed the gpu-support branch from 78a6951 to c87ac5e Compare March 23, 2017 16:01

cmluciano reviewed Mar 24, 2017

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2017

k8s-github-robot merged commit 3a3dc82 into kubernetes:master Apr 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Juju: Enable GPU mode if GPU hardware detected #43467

Juju: Enable GPU mode if GPU hardware detected #43467

tvansteenburgh commented Mar 21, 2017 •

edited

Loading

k8s-reviewable commented Mar 21, 2017

k8s-ci-robot commented Mar 21, 2017

marcoceppi commented Mar 21, 2017

marcoceppi Mar 21, 2017

tvansteenburgh Mar 21, 2017

marcoceppi commented Mar 21, 2017

lazypower commented Mar 23, 2017

cmluciano Mar 24, 2017

cmluciano Mar 24, 2017

tvansteenburgh Mar 24, 2017

lazypower commented Mar 30, 2017

lazypower commented Apr 4, 2017

k8s-github-robot commented Apr 4, 2017

k8s-github-robot commented Apr 4, 2017

		remove_state('kubernetes-worker.kubelet.restart')


		@when('cuda.installed')

Juju: Enable GPU mode if GPU hardware detected #43467

Juju: Enable GPU mode if GPU hardware detected #43467

Conversation

tvansteenburgh commented Mar 21, 2017 • edited Loading

k8s-reviewable commented Mar 21, 2017

k8s-ci-robot commented Mar 21, 2017

marcoceppi commented Mar 21, 2017

marcoceppi Mar 21, 2017

Choose a reason for hiding this comment

tvansteenburgh Mar 21, 2017

Choose a reason for hiding this comment

marcoceppi commented Mar 21, 2017

lazypower commented Mar 23, 2017

cmluciano Mar 24, 2017

Choose a reason for hiding this comment

cmluciano Mar 24, 2017

Choose a reason for hiding this comment

tvansteenburgh Mar 24, 2017

Choose a reason for hiding this comment

lazypower commented Mar 30, 2017

lazypower commented Apr 4, 2017

k8s-github-robot commented Apr 4, 2017

k8s-github-robot commented Apr 4, 2017

tvansteenburgh commented Mar 21, 2017 •

edited

Loading