Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Juju: Enable GPU mode if GPU hardware detected #43467

Merged
merged 1 commit into from
Apr 4, 2017

Conversation

tvansteenburgh
Copy link
Contributor

@tvansteenburgh tvansteenburgh commented Mar 21, 2017

What this PR does / why we need it:

Automatically configures kubernetes-worker node to utilize GPU hardware when such hardware is detected.

layer-nvidia-cuda does the hardware detection, installs CUDA and Nvidia
drivers, and sets a state that the k8s-worker can react to.

When gpu is available, worker updates config and restarts kubelet to
enable gpu mode. Worker then notifies master that it's in gpu mode via
the kube-control relation.

When master sees that a worker is in gpu mode, it updates to privileged
mode and restarts kube-apiserver.

The kube-control interface has subsumed the kube-dns interface
functionality.

An 'allow-privileged' config option has been added to both worker and
master charms. The gpu enablement respects the value of this option;
i.e., we can't enable gpu mode if the operator has set
allow-privileged="false".

Special notes for your reviewer:

Quickest test setup is as follows:

# Bootstrap. If your aws account doesn't have a default vpc, you'll need to
# specify one at bootstrap time so that juju can provision a p2.xlarge.
# Otherwise you can leave out the --config "vpc-id=vpc-xxxxxxxx" bit.
juju bootstrap --config "vpc-id=vpc-xxxxxxxx" --constraints "cores=4 mem=16G root-disk=64G" aws/us-east-1 k8s

# Deploy the bundle containing master and worker charms built from
# https://github.com/tvansteenburgh/kubernetes/tree/gpu-support/cluster/juju/layers
juju deploy cs:~tvansteenburgh/bundle/kubernetes-gpu-support-3

# Setup kubectl locally
mkdir -p ~/.kube
juju scp kubernetes-master/0:config ~/.kube/config
juju scp kubernetes-master/0:kubectl ./kubectl

# Download a gpu-dependent job spec
wget -O /tmp/nvidia-smi.yaml https://raw.githubusercontent.com/madeden/blogposts/master/k8s-gpu-cloud/src/nvidia-smi.yaml

# Create the job
kubectl create -f /tmp/nvidia-smi.yaml

# You should see a new nvidia-smi-xxxxx pod created
kubectl get pods

# Wait a bit for the job to run, then view logs; you should see the
# nvidia-smi table output
kubectl logs $(kubectl get pods -l name=nvidia-smi -o=name -a)

kube-control interface: https://github.com/juju-solutions/interface-kube-control
nvidia-cuda layer: https://github.com/juju-solutions/layer-nvidia-cuda
(Both are registered on http://interfaces.juju.solutions/)

Release note:

Juju: Enable GPU mode if GPU hardware detected

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 21, 2017
@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 21, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @tvansteenburgh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2017
@marcoceppi
Copy link
Contributor

@k8s-bot ok to test

@k8s-github-robot k8s-github-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Mar 21, 2017
@@ -0,0 +1,48 @@
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we have the right headers in place for these new files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, right. Good catch, will fix.

@marcoceppi
Copy link
Contributor

/approve

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2017
@lazypower
Copy link
Contributor

@k8s-bot cvm gce e2e test this

layer-nvidia-cuda does the hardware detection and sets a state that the
worker can react to.

When gpu is available, worker updates config and restarts kubelet to
enable gpu mode. Worker then notifies master that it's in gpu mode via
the kube-control relation.

When master sees that a worker is in gpu mode, it updates to privileged
mode and restarts kube-apiserver.

The kube-control interface has subsumed the kube-dns interface
functionality.

An 'allow-privileged' config option has been added to both worker and
master charms. The gpu enablement respects the value of this option;
i.e., we can't enable gpu mode if the operator has set
allow-privileged="false".
# Not sure why this is necessary, but if you don't run this, k8s will
# think that the node has 0 gpus (as shown by the output of
# `kubectl get nodes -o yaml`
check_call(['nvidia-smi'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you open a bug for this and cc @cmluciano

remove_state('kubernetes-worker.kubelet.restart')


@when('cuda.installed')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only check for detecting if GPUs are exposed? I may be missing logic somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the actual logic for detection is in a separate layer that this charm uses (https://github.com/juju-solutions/layer-nvidia-cuda). If GPUs are detected, the nvidia-cuda layer sets the 'cuda.installed' state so other layers can react to it.

@lazypower
Copy link
Contributor

+1 LGTM

/approve

@lazypower
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chuckbutler, marcoceppi, tvansteenburgh

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 44047, 43514, 44037, 43467)

@k8s-github-robot k8s-github-robot merged commit 3a3dc82 into kubernetes:master Apr 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants