The Cluster Health Scanner tool or simply CHS runs a series of tests called health checks to analyze the health of a cluster of GPU nodes.
For instructions on how to run CHS, go directly to the 'Running CHS' section.
While currently structured for Google Kubernetes Engine (GKE), CHS can theoretically run on clusters using other Kubernetes orchestration implementations. There are plans to enable CHS to run on additional cluster orchestrators, such as Slurm for HPC.
The CHS structure runs with the concept of a 'Health Runner' that controls the scheduling, launching, and post-processing of individual health checks that run on the nodes in the cluster.
This repository is structured to separate running/deploying CHS and building CHS.
The deploy/
directory contains the code to deploy CHS on a
cluster.
This is currently done for GKE clusters using Helm. If
you're looking to quickly start using CHS on your cluster, you only need to
install the Health Runner Helm chart, which contains all the health checks
included in CHS.
See the section below for details on running CHS.
The src/
directory contains the code that is run as part of the
Health Runner & health checks.
This is not necessary to run CHS. Instead the code in src/
is provided
for those wanting to build CHS directly.
See the section below for more details.
The docker/
directory contains the Dockerfiles that can be
used with src/
to build your own Docker images for CHS.
This is not necessary to run CHS. However, certain users might find it
useful to customize their version of CHS.
CHS is broken up into the following parts:
- Health Runner
- Manages how the health checks are launched on the nodes and then later cleaned up after completion
- Health Checks
- Kicked off by the Health Runner and run on the nodes in the cluster to report their results to reflect the health of the node(s)/cluster
A user is able to configure the Health Runner to run particular health checks. This configuration includes settings for how the Health Runner will run as well as settings for the given health checks that will be run on nodes.
The Health Runner coordinates how and what health checks get launched on what nodes.
The user configures the Health Runner to specify:
- Which health checks to run
- Settings/configuration of those health checks
- Other features of the Health Runner launching health checks such as how many health checks should run in parallel, what Docker image to use for the health check, and how often to launch health checks on the appropriate nodes.
Health checks can run on single or multiple nodes. As health checks complete, results are reported by node labels.
These health checks are all launched by the Health Runner & are set by the Health Runner's configuration.
Runs on two nodes to check networking across these nodes. A 'pass' is given to both nodes when they have an average bandwidth that meets or exceeds a given threshold.
The NCCL health check will pair nodes based on the parameter PAIRING_MODE
in
the health runner configuration.
By default, NCCL health check runs in random
pairing mode which runs the
health checks between random pairs of nodes. These are the supported topological
aware pairing modes:
- Random Pairs (default) -
PAIRING_MODE: random
- Ignores topology of cluster and randomly pairs nodes together
- When second pass is enabled (
SECOND_PASS_ENABLED: true
), a NCCL test failure will proceed with another random pairing but this time using one passed node and one failed node in each pair.
- Intra Rack -
PAIRING_MODE: intra_rack
- Checks the racks communication by running NCCL tests between each node in the same rack.
- When second pass is enabled (
SECOND_PASS_ENABLED: true
), a NCCL test failure will proceed with another pairing but this time using one passed node and one failed node in each pair (within the rack).
- Inter Rack -
PAIRING_MODE: inter_rack
- Checks inter-rack communication by running NCCL tests between nodes in different racks. (The racks will be in the same cluster).
- When second pass is enabled (
SECOND_PASS_ENABLED: true
), a rack that failed the first initial NCCL test will be randomly paired with a different rack that previously passed. - Note the nodes will still be labeled as pass/fail but only failed nodes in a failed rack will be tested again if second pass is enabled.
- Inter Cluster -
PAIRING_MODE: inter_cluster
- Checks the inter-cluster communication by running NCCL tests between nodes in different clusters.
- When second pass is enabled (
SECOND_PASS_ENABLED: true
), a cluster that failed the first initial NCCL test will be randomly paired with a different cluster that previously passed. - Note the nodes will still be labeled as pass/fail but only failed nodes in a failed cluster will be tested again if second pass is enabled.
Note that, for all pairing modes, second-pass is enabled by default which runs the NCCL test for pairs of a healthy and unhealthy node. This reduces the falsely tainted nodes.
Runs the NVIDIA's DCGM diagnostic tool to report a single node's health. A 'pass' is given when no errors appear while running the tool.
Runs the neper Linux networking performance tool to report the health of a node. A 'pass' is given when the bandwidth across connections in the node meet or exceed a given threshold.
Running CHS only requires installing the Health Runner on the cluster.
The Health Runner will be able to launch health checks on the cluster based on the user's installation configuration.
Note: Currently this is done on GKE/Kubernetes using Helm charts. The description below focuses on running CHS using Helm charts on GKE.
Nodes to be included in a health check are marked using a corresponding node label.
The node label keys depend on the health check,
with expected values of "true"
:
- NCCL Health Check:
aiinfra/nccl-healthcheck-test
- GPU Health Check:
aiinfra/nccl-healthcheck-test
- Neper Health Check:
aiinfra/neper-healthcheck-test
These label keys & values can be set using the kubectl
tool
using the following command:
kubectl label nodes \
--all \
aiinfra/nccl-healthcheck-test="true"
Note: This sets all nodes to be labeled for the NCCL health check.
The user can configure the Health Runner via the command line or as part of a YAML configuration file. This configuration also gives the settings for the health checks to be run.
Refer to the 'Default Configuration' section for an example of a full configuration file.
The following are the Health Runner configuration options:
This will be used as the name of the Kubernetes Job for the Health Runner.
health_runner:
name: "health-runner"
Each health check is listed under the health_checks
section. It is specific
to each health check though there are specific settings that apply to all
health checks.
Note in the section below we use the placeholder HC_NAME
that would
be replaced with an identifying name of a health check, such as
nccl_healthcheck
.
This is either true
or false
and gives the name of the Job to be used to
run the health check.
The value for runner_name
will be used as the base of the name of the
Kubernetes Job for each health check instance launched.
This section specifies information regarding the Docker image for health check.
health_checks.HC_NAME.image.repo
: the base repo URL for the Docker image for the health check.health_checks.HC_NAME.image.tag
: the image tag for the Docker image for the health check.health_checks.HC_NAME.image.pull_policy
: the pull policy for the Docker image for the health check.
Example:
health_checks:
HC_NAME:
...
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "subset"
pull_policy: "Always"
...
The blast_mode
section of the configuration gives settings for running health
checks in parallel.
health_checks.HC_NAME.blast_mode.blast_mode_enabled
: set to"true"
or "false". If set to"false"
, a failed health check will taint the corresponding node(s).health_checks.HC_NAME.blast_mode.BLAST_MODE_NUM_TESTS_LIMIT
: set to an integer specifying how many health checks can be launched simultaneously across the cluster.health_checks.HC_NAME.blast_mode.NODES_CHECKED_PER_TEST
: set to an integer to specify how many nodes are run for each test. NCCL & neper health checks use 2 nodes while the GPU health check only uses 1.
The env
section of the configuration is specific to each health check and is
used to modify the settings for the health check(s) to be kicked off by the
Health Runner. Some settings are specific to the health check type but there
are others that are universal to all health checks.
health_checks.HC_NAME.env.DRY_RUN
: this is either set to"true"
or"false"
. If set to"false"
, if a health check fails on a node or nodes it will taint the respective node/nodes.health_checks.HC_NAME.env.SLEEP_TIME_MINUTES
: this is set to integer value and acts as a timeout for the health check, specifying the maximum time allowed for completion. If a health check exceeds this time, it is canceled, and the test result is not updated.health_checks.HC_NAME.env.YAML_FILE
: this specifies the YAML file used by the Health Runner to launch the health check. This YAML file must be present in the Health Runner container (via the Docker image).
health_checks.HC_NAME.env.YAML_FILE
: must be set to either"a3plus/nccl_healthcheck.yaml"
or"a3/nccl_healthcheck.yaml"
, depending on the nodes' accelerator type.
health_checks.HC_NAME.env.YAML_FILE
: must be set to"gpu_healthcheck.yaml"
.health_checks.HC_NAME.env.R_LEVEL
: set to1
,2
,3
, or4
defining what level of diagnostics to run. Lower numbers indicate faster but more basic diagnostics. It is recommended to set to2
or3
with the3
being a longer more extensive diagnostic check.
health_checks.HC_NAME.env.YAML_FILE
: must be set to"neper_healthcheck.yaml"
.
The default configuration is set so that the Health Runner will run only the NCCL health check every 5 minutes (10 health checks at a time) for A3+ GPU nodes.
The default configuration for the Health Runner (found in the Helm chart values.yaml file) is shown below:
health_runner:
name: "health-runner"
health_checks:
nccl_healthcheck:
run_check: true
runner_name: nccl-health-runner-a3plus
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "v4.1.0"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "v4.1.0"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/nccl_healthcheck"
HELM_INSTALL_FLAGS: "-f /app/health_checks/nccl_healthcheck/a3plus.yaml --set health_check.image.tag=$HC_IMAGE_TAG"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
HEALTH_APP: "nccl"
PAIRING_MODE: "random"
SECOND_PASS_ENABLED: "true"
blast_mode:
blast_mode_enabled: true
env:
NODES_CHECKED_PER_TEST: "2"
gpu_healthcheck:
run_check: false
runner_name: gpu-health-runner
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "v4.1.0"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "v4.1.0"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/gpu_healthcheck"
HELM_INSTALL_FLAGS: "--set health_check.image.tag=$HC_IMAGE_TAG"
R_LEVEL: "2"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
blast_mode:
blast_mode_enabled: true
env:
NODES_CHECKED_PER_TEST: "1"
neper_healthcheck:
run_check: false
runner_name: neper-health-runner
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "v4.1.0"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "v4.1.0"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/neper_healthcheck"
HELM_INSTALL_FLAGS: "--set health_check.image.tag=$HC_IMAGE_TAG"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
blast_mode:
blast_mode_enabled: true
env:
NODES_CHECKED_PER_TEST: "2"
Running CHS involves installing Health Runner. This is done on a Kubernetes orchestration by deploying the Helm chart for Health Runner.
The Health Runner Helm chart can be used to install the release using the
helm
command shown below:
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner
This will install the Health Runner with the default configuration which will kick off the health checks automatically to be run on the nodes in the cluster.
You can also specify your own configuration using your own value files:
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release-custom-config"
MY_CONFIG="./my-config.yaml"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner \
-f "${MY_CONFIG}"
You can also set specific configurations in the command line using
helm install
--set
parameter.
For example, the following command launches only the GPU health check on the
nodes using R_LEVEL: "1"
instead of the default values.
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release-gpu-only"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner \
--set health_checks.nccl_healthcheck.run_check=false \
--set health_checks.gpu_healthcheck.run_check=true \
--set health_checks.gpu_healthcheck.R_LEVEL="1" \
As the Health Runner launches health checks, runs them on nodes, and they complete, users can view the health check results.
Health check results are stored as node labels and can be viewed using the
Kubernetes kubectl
tool.
The following command displays results for the NCCL health check for each node:
CUSTOM_COLS="NODE:.metadata.name,MARK:.metadata.labels.aiinfra/nccl-healthcheck-test,BANDWIDTH:.metadata.labels.aiinfra/nccl-healthcheck-bandwidth,RESULT:.metadata.labels.aiinfra/nccl-healthcheck-result,RUNTIME:.metadata.labels.aiinfra/nccl-healthcheck-runtime-sec"
kubectl get nodes -o custom-columns="${CUSTOM_COLS}"
This outputs a table with columns showing the node names and the status of each of their tags.
If the command watch
is installed, you can create a dynamic display for live
updates.
watch -n 10 -d "kubectl get nodes -o custom-columns=${CUSTOM_COLS}"
watch
reruns the table display command every 10 seconds, highlighting any
changes.
After deploying and running CHS, users may wish to clean up the installation.
To uninstall the Health Runner (a Helm release), use the release name
(RELEASE_NAME
) in the following command:
helm uninstall RELEASE_NAME
While the Health Runner Helm chart simplifies cleanup, it's important to remove any lingering Jobs in the cluster that are not removed automatically.
You can list these with a command like the following:
kubectl get jobs | grep healthcheck
To remove lingering Jobs:
kubectl delete jobs $JOB_NAME_0 $JOB_NAME_1
Because Jobs from CHS tend to have similar names, you can filter those jobs
by name (such as healthcheck
in this example) with something like below:
# Gets list of jobs, filters for `healthcheck`, selects only the Job name
kubectl get jobs \
| grep healthcheck \
| cut -d ' ' -f1
After confirming the jobs listed are the ones to delete, you can use the command below to delete those jobs:
kubectl get jobs --no-headers \
| grep healthcheck \
| cut -d ' ' -f1 \
| xargs kubectl delete jobs
CHS already has available Docker images that can be used via installing the default Health Runner Helm chart. However, some users may find it useful to create and use the Docker images for CHS themselves.
This could be to add custom functionality, extending CHS for their use case, or simply creating their own images that they can pull from their registry.
Whatever the case, this can be done by using the Dockerfiles found in the
docker/
folder. There you will find Dockerfiles for the Health
Runner and one for each health check.
You can build your own Docker images using the Dockerfiles with the
docker build
command:
docker build \
-f docker/health_runner.Dockerfile .
This builds the image for the Health Runner and can be similarly done for each
health check. It uses the code found in the src/
directory.
After the images have been built and uploaded to a registry, the Helm chart for the Health Runner can be installed just as we saw in the 'Running CHS' section but instead specifying the image repo and tag for the health check in the configuration.
The following is an example of how a user could build a Docker image, push that image to their registry, and then run Health Runner with that newly created image using the Helm chart we saw before:
REGISTRY_NAME="my-registry-example.com"
USERNAME="my-user-name"
IMAGE_NAME="my-health-runner"
IMAGE_TAG="v1.0"
FULL_REPO_NAME="${REGISTRY_NAME}/${USERNAME}/${IMAGE_NAME}"
docker build \
-t "${FULL_REPO_NAME}:${IMAGE_TAG}"
-f docker/health_runner.Dockerfile .
docker push "${FULL_REPO_NAME}:${IMAGE_TAG}"
MY_HEALTH_RUNNER_RELEASE_NAME="my-custom-hr"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner \
--set health_checks.nccl_healthcheck.run_check=false \
--set health_checks.gpu_healthcheck.run_check=true \
--set health_checks.gpu_healthcheck.image.repo="${FULL_REPO_NAME}"
--set health_checks.gpu_healthcheck.image.tag="${IMAGE_TAG}"
The following sections show some potentially useful commands and processes when working with CHS. These sections are organized based on the CHS stage you are focusing on.
CHS uses Helm as the main method of deployment for a cluster orchestrated with Kubernetes.
Helm's installation documentation provides detailed instructions. However, most users can simply use the following commands to download and run the installer script:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
If CHS was deployed previously, you might find you need to do some clean up. Below are some commands that can help in identifying resources/installations to clean up.
To list current Helm Release installations:
helm ls
To more thoroughly list current Helm Release installations, you can include the
-a
flag:
helm ls -a
To view current Jobs in Kubernetes cluster:
kubectl get jobs
NOTE: Since the list of Jobs can be quite large, you may find it useful to filter the output of the above commands by piping the results with
grep
.For example, the following will get the list of Jobs that have the phrase 'healthcheck':
kubectl get jobs | grep healthcheck
It might be desirable or necessary to remove some node labels that were part of previous CHS runs/deployments.
First, it's helpful to identify the labels on nodes.
You can inspect a particular node $NODE_NAME
in a few ways.
This command will give details of the node including the labels near the beginning of the output:
kubectl describe nodes $NODE_NAME
This command will print information about the node in a table format and include a list of labels:
kubectl get nodes $NODE_NAME --show-labels=true
You can also run either of these commands on multiple nodes.
This can be done for a subset of nodes by listing them:
kubectl get nodes $NODE_NAME_0 $NODE_NAME_1 $NODE_NAME_2 --show-labels=true
kubectl describe nodes $NODE_NAME_0 $NODE_NAME_1 $NODE_NAME_2
You can also run the same command for all the nodes by not listing any specific nodes:
kubectl get nodes --show-labels=true
kubectl describe nodes
However, this can be overwhelming and time-consuming, depending on the number of nodes in your cluster.
To remove labels from nodes, use the following command on specified nodes,
where LABEL_NAME
is the name of the label (note the -
suffix):
kubectl label nodes $NODE_NAME_0 $NODE_NAME_1 LABEL_NAME-
This command can also be used to remove multiple labels:
kubectl label nodes $NODE_NAME_0 $NODE_NAME_1 LABEL_NAME- LABEL_NAME_2-
You may also want to remove labels from all nodes in the cluster:
kubectl label nodes --all LABEL_NAME-
You can filter the command to apply only to certain nodes based on another
label (FILTER_LABEL
):
# Any nodes with FILTER_LABEL (any value)
kubectl label nodes -l FILTER_LABEL
# Any nodes with the value 'VALUE' for the label FILTER_LABEL
kubectl label nodes -l FILTER_LABEL=VALUE
Health checks require you to label the nodes before running the health check.
The command below labels a specific set of nodes with the label LABEL_NAME
and value VALUE
:
kubectl label nodes $NODE_NAME_0 $NODE_NAME_1 LABEL_NAME=VALUE
You can also apply to all nodes or apply to nodes with a specific different
label FILTER_LABEL
& value FILTER_VALUE
:
# Applies to all nodes in cluster
kubectl label nodes --all LABEL_NAME=VALUE
# Any nodes with FILTER_LABEL (any value)
kubectl label nodes -l FILTER_LABEL LABEL_NAME=VALUE
# Any nodes with the value 'FILTER_VALUE' for the label FILTER_LABEL
kubectl label nodes -l FILTER_LABEL=FILTER_VALUE LABEL_NAME=VALUE
This section briefly shows some options for deploying with Helm. However, users may find the specific documentation within a Helm chart to be helpful.
The below command will install the Helm chart CHS_CHART
with the Release name
MY_RELEASE
using the default configurations for the chart:
helm install MY_RELEASE CHS_CHART
You can overwrite the defaults in the command with the --set
flag:
helm install MY_RELEASE CHS_CHART \
--set health_check.param_0=true \
--set health_check.env.setting_0="value"
You can also specify a file, such as my_values.yaml
, to provide configuration
values for your installation:
helm install MY_RELEASE CHS_CHART \
-f my_values.yaml
Note on deploying without Helm
Some users may find it useful for their workflow to not use Helm to deploy but still work with an equivalent YAML file.
For details on how to generate a YAML file for Helm, see the section "Generating YAML from Helm"
After deploying the CHS Health Runner, you may want to observe relevant information about the cluster's nodes.
The watch
command can be paired with other commands to show updates to
relevant observations.
For example, the following command will display jobs that match 'healthcheck' every 10 seconds while highlighting the differences from each update:
watch -n10 -d "kubectl get jobs | grep healthcheck"
Since CHS uses node labels to update information about health checks for that node, it can be useful to print information about those nodes.
We already saw how to get some information on nodes in the 'Identifying labels' section:
kubectl describe nodes
kubectl get nodes
We can go further by filtering for nodes with a particular label LABEL
and
specific values on those labels VALUE
:
# Get nodes that have the 'LABEL' label (any value)
kubectl get nodes -l LABEL
# Get nodes that have the value 'VALUE' for the label 'LABEL'
kubectl get nodes -l LABEL=VALUE
We can also customize what is printed out in kubectl get nodes
by specifying
the data to display using the flag -o custom-columns=$CUSTOM_COLS
where
CUSTOM_COLS
is a string that specifies how and what data to display.
For example, we can define CUSTOM_COLS
where TEST_LABEL
is a label for if
the node should be tested and RESULT_LABEL
is the label with the value for
the health check's result:
CUSTOM_COLS="NODE:.metadata.name,TEST:.metadata.labels.TEST_LABEL,RESULT:.metadata.labels.RESULT_LABEL
kubectl get nodes -o custom-columns=$CUSTOM_COLS
The output something like this:
NODE TEST RESULT
gke-cluter-name-np-0-dcs1a6c6-24rm true pass
gke-cluter-name-np-0-dcs1a6c6-7qd5 true <none>
gke-cluter-name-np-0-dcs1a6c6-81aw <none> <none>
gke-cluter-name-np-0-dcs1a6c6-8lc9 false fail
gke-cluter-name-np-0-dcs1a6c6-j3q0 true fail
gke-cluter-name-np-0-dcs1a6c6-tzl6 false pass
gke-cluter-name-np-0-dcs1a6c6-wkd3 <none> fail
gke-cluter-name-np-0-dcs1a6c6-z0mn <none> true
Note that the <none>
values mean that the node does not have that label.
This also does not filter nodes and will print the specified custom columns for
all nodes. To further specify nodes, you can pass other flags such as -l
to
kubectl get nodes
like below:
kubectl get nodes -l RESULT_LABEL='fail' -o custom-columns=$CUSTOM_COLS
Which will produce something like this (using the previous example result):
NODE TEST RESULT
gke-cluter-name-np-0-dcs1a6c6-8lc9 false fail
gke-cluter-name-np-0-dcs1a6c6-j3q0 true fail
gke-cluter-name-np-0-dcs1a6c6-wkd3 <none> fail
After deploying and running CHS, users might desire to clean up their installation.
To uninstall a Helm release, use the release name RELEASE_NAME
in the
following command:
helm uninstall RELEASE_NAME
Because some users may have opted to use kubectl
instead of directly using
the Helm chart to deploy, it's important to uninstall using kubectl delete
with the same YAML file that was used in kubectl apply
:
kubectl delete -f my_deployment.yaml
Using a Helm chart simplifies cleanup, but it's important to remove any lingering Jobs in your cluster that don't get cleaned up automatically.
You can list these with the commands using grep
:
kubectl get jobs | grep healthcheck
To remove lingering Jobs:
kubectl delete jobs JOB_NAME_0 JOB_NAME_1
Because Jobs from CHS tend to have similar names, you can filter those jobs by
name (like healthcheck
in this example):
# Gets list of jobs, filters for `healthcheck`, selects only the Job name
kubectl get jobs \
| grep healthcheck \
| cut -d ' ' -f1
After confirming that the listed jobs are the ones to delete, use the above command to delete them:
kubectl get jobs \
| grep healthcheck \
| cut -d ' ' -f1 \
| xargs kubectl delete jobs
Although deployment with Helm is the recommended method for CHS, users may find it useful for their workflow to produce an equivalent YAML file.
To generate the file, you can run the same helm
command with configurations
as you would with installation but instead replacing helm install
with
helm template
and then redirecting the standard output to a file
my_deployment.yaml
:
helm template MY_RELEASE CHS_CHART \
-f my_values.yaml \
--set health_check.param_0=true \
--set health_check.env.setting_0="value" \
> my_deployment.yaml
A user can then install with kubectl
using the generated YAML file with
something like below:
kubectl apply -f my_deployment.yaml