The Cluster Health Scanner (CHS) is a tool that checks the health of a GPU cluster. It runs various tests to ensure the cluster is ready for training workloads, specifically:
- NCCL check: Validates the network communication between GPUs using the NCCL library.
- GPU check: Utilizes NVIDIA's DCGM tool to diagnose the health of individual GPUs.
- Neper check: Leverages the Neper tool to assess network performance within the cluster.
- Straggler detection: Runs a network traffic pattern between nodes that closely resemble that encountered during LLM training workload pipeline parallelism.
- Tinymax check: Uses Maxtext open source LLM framework to assess ml training within the cluster.
CHS serves two main purposes:
- Proactive health checks: Ensures the cluster is in optimal condition for running training workloads.
- Debugging tool: Helps diagnose issues when you encounter problems with a training workload.
GPU Cluster availability: A3 and A3+
Orchestrator support: GKE and Slurm
The Cluster Health Scanner tool or simply CHS runs a series of tests called health checks to analyze the health of a cluster of GPU nodes.
For instructions on how to run CHS, go directly to the 'Running CHS' section.
While currently structured for Google Kubernetes Engine (GKE), CHS can theoretically run on clusters using other Kubernetes orchestration implementations. We have enabled CHS to run on additional cluster orchestrators, such as Slurm for HPC.
A tool for diagnosing cluster issues.
The cluster_diag
tool is a helpful wrapper around the CHS diagnostic tool for
the
Accelerator-optimized machine family
(currently only a3-highgpu-g8 and a3-megagpu-8g). It is exposed via the
healthscan
command and aims to provide a single line command that can run
CHS, with no prior knowledge needed of CHS implementation details.
NOTE: The cluster_diag
tool aims to provide a joyful experience for running
Cluster Health Scanner; however it may not support all use cases. To run CHS
directly, see the instructions
via the instructions in the developer guide
.
NOTE: cluster_diag
expects that you have already authenticated the gcloud
cli to access the Google Cloud Platform with Google user credentials.
Additionally, gcloud
should already have credentials for the
cluster-under-test.
-
Clone this repository
-
If you don't already have them, install dependencies for the CLI:
pip3 install click pip3 install kubernetes
-
From the root dir of this repository, run
python3 cli/cluster_diag.py
NOTE:
cluster_diag
currently only works from the root dir of this repo. See the Usage section for more details. -
(Optional) Use an alias to simplify usage and store common flags. For example, if you only use clusters orchestrated by GKE, you can use:
alias cluster_diag="python3 cli/cluster_diag.py -o gke"
$ cluster_diag
Usage: cluster_diag [OPTIONS] COMMAND [ARGS]...
A tool for diagnosing cluster issues.
Options:
-o, --orchestrator [gke|slurm] Cluster orchestrator type. [required]
--version Show the version and exit.
--help Show this message and exit.
Commands:
healthscan Run a healthscan on a cluster.
Runs a CHS healthscan on a cluster.
$ cluster_diag -o gke healthscan a3-megagpu-8g --help
Usage: cluster_diag healthscan [OPTIONS] {a3-highgpu-8g | a3-megagpu-8g | a3-ultragpu-8g}
Run a healthscan on a cluster.
Options:
-c, --check [status|nccl|gpu|straggler|neper|tinymax]
Check to run. Available checks:
- status: (Default) Checks the current healthscan status of the cluster.
- nccl: Runs a pairwise NCCL bandwidth test on the cluster.
- gpu: Runs a GPU check on the cluster.
- straggler: Instruments a straggler check on the cluster.
- neper: Runs a Network Performance eval on the cluster.
- tinymax: Runs a LLM small training workload on the cluster.
-n, --nodes TEXT Nodes to run checks on. Defaults to running
on all nodes. When using slurm, a shortened
node format can be used. For example,
"node-[0-1]"
--run_only_on_available_nodes Force running the healthcheck only on
available nodes. Unavailable nodes will be
skipped.
--dry_run Run the healthcheck in dry run mode. This
will print the commands that would be run,
but not run them.
--help Show this message and exit.
Action | Command to Run |
---|---|
Get GKE cluster status | $ cluster_diag -o gke healthscan a3-megagpu-8g -c status |
Running a DCGM/GPU check | $ cluster_diag -o gke healthscan a3-megagpu-8g -c gpu |
Running a DCGM/GPU check only on available nodes | $ cluster_diag -o gke healthscan a3-megagpu-8g -c gpu --run_only_on_available_nodes |
Running a DCGM/GPU check on two Slurm Nodes | $ cluster_diag -o slurm healthscan a3-megagpu-8g -c gpu -n node-[0-1] |
Dry run of a DCGM/GPU check | $ cluster_diag -o slurm healthscan a3-megagpu-8g -c gpu -n node-[0-1] --dry_run |
WARNING: Running CHS directly as described in this section is not preferred, and could result in unintended behavior. Please read this entire section before continuing.
Running CHS only requires installing the Health Runner on the cluster.
The Health Runner will be able to launch health checks on the cluster based on the user's installation configuration.
Note: Currently this is done on GKE/Kubernetes using Helm charts. The description below focuses on running CHS using Helm charts on GKE.
Nodes to be included in a health check are marked using a corresponding node label.
The node label keys depend on the health check,
with expected values of "true"
:
- NCCL Health Check:
aiinfra/nccl-healthcheck-test
- GPU Health Check:
aiinfra/gpu-healthcheck-test
- Neper Health Check:
aiinfra/neper-healthcheck-test
- Tinymax Health Check:
aiinfra/tinymax-healthcheck-test
These label keys & values can be set using thekubectl
tool using the following command:
kubectl label nodes \
--all \
aiinfra/nccl-healthcheck-test="true"
Note: This sets all nodes to be labeled for the NCCL health check.
To help Google Cloud engineers diagnose and resolve any potential issues with your cluster, you can optionally configure CHS to send its logs to Google. This allows our engineers to access only the logs from CHS and not the logs from the rest of the cluster.
This can be configured by the following steps:
-
Create a Service Account: In the Google Cloud project where your cluster is running, create a new service account specifically for sending CHS logs to Google.
-
Contact Google Support: Reach out to Google Cloud Support and provide them with the name of the service account you created. They will grant the necessary permissions for this service account to write logs to Google Cloud Logging.
-
Create a Service Account Key: Generate a JSON key file for the service account. This key file will be used by CHS to authenticate with Google Cloud Logging.
-
Create a Kubernetes Secret: Use the following command to create a Kubernetes secret containing the service account key:
kubectl create secret generic fluentbit-key --from-file=key.json=key.json
By following these steps, you can enable log forwarding of CHS logs to Google and help our engineers provide you with the best possible support for your cluster.
The user can configure the Health Runner via the command line or as part of a YAML configuration file. This configuration also gives the settings for the health checks to be run.
Refer to the 'Default Configuration' section for an example of a full configuration file.
The following are the Health Runner configuration options:
This will be used as the name of the Kubernetes Job for the Health Runner.
health_runner:
name: "health-runner"
Each health check is listed under the health_checks
section. It is specific
to each health check though there are specific settings that apply to all
health checks.
Note in the section below we use the placeholder HC_NAME
that would
be replaced with an identifying name of a health check, such as
nccl_healthcheck
.
This is either true
or false
and gives the name of the Job to be used to
run the health check.
The value for runner_name
will be used as the base of the name of the
Kubernetes Job for each health check instance launched.
This section specifies information regarding the Docker image for health check.
health_checks.HC_NAME.image.repo
: the base repo URL for the Docker image for the health check.health_checks.HC_NAME.image.tag
: the image tag for the Docker image for the health check.health_checks.HC_NAME.image.pull_policy
: the pull policy for the Docker image for the health check.
Example:
health_checks:
HC_NAME:
...
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "subset"
pull_policy: "Always"
...
The blast_mode
section of the configuration gives settings for running health
checks in parallel.
health_checks.HC_NAME.blast_mode.blast_mode_enabled
: set to"true"
or "false". If set to"false"
, a failed health check will taint the corresponding node(s).health_checks.HC_NAME.blast_mode.BLAST_MODE_NUM_TESTS_LIMIT
: set to an integer specifying how many health checks can be launched simultaneously across the cluster.health_checks.HC_NAME.blast_mode.NODES_CHECKED_PER_TEST
: set to an integer to specify how many nodes are run for each test. NCCL & neper health checks use 2 nodes while the GPU & tinymax health checks only uses 1.
The env
section of the configuration is specific to each health check and is
used to modify the settings for the health check(s) to be kicked off by the
Health Runner. Some settings are specific to the health check type but there
are others that are universal to all health checks.
health_checks.HC_NAME.env.DRY_RUN
: this is either set to"true"
or"false"
. If set to"false"
, if a health check fails on a node or nodes it will taint the respective node/nodes.health_checks.HC_NAME.env.SLEEP_TIME_MINUTES
: this is set to integer value and acts as a timeout for the health check, specifying the maximum time allowed for completion. If a health check exceeds this time, it is canceled, and the test result is not updated.health_checks.HC_NAME.env.YAML_FILE
: this specifies the YAML file used by the Health Runner to launch the health check. This YAML file must be present in the Health Runner container (via the Docker image).
health_checks.HC_NAME.env.YAML_FILE
: must be set to either"a3ultra/nccl_healthcheck.yaml"
or"a3plus/nccl_healthcheck.yaml"
or"a3/nccl_healthcheck.yaml"
, depending on the nodes' accelerator type.
health_checks.HC_NAME.env.YAML_FILE
: must be set to"gpu_healthcheck.yaml"
.health_checks.HC_NAME.env.R_LEVEL
: set to1
,2
,3
, or4
defining what level of diagnostics to run. Lower numbers indicate faster but more basic diagnostics. It is recommended to set to2
or3
with the3
being a longer more extensive diagnostic check.
health_checks.HC_NAME.env.YAML_FILE
: must be set to"neper_healthcheck.yaml"
.
health_checks.HC_NAME.env.YAML_FILE
: must be set to"tinymax_healthcheck.yaml"
.
The default configuration is set so that the Health Runner will run only the NCCL health check every 5 minutes (10 health checks at a time) for A3+ GPU nodes.
The default configuration for the Health Runner (found in the Helm chart values.yaml file) is shown below:
health_runner:
base_name: "chs-hr"
health_checks:
nccl_healthcheck:
run_check: true
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
FILTER_LABEL_NAME: "aiinfra/nccl-healthcheck-test"
FILTER_LABEL_VALUE: "true"
HELM_CHART: "/app/health_checks/nccl_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "-f /app/health_checks/nccl_healthcheck/a3plus.yaml --set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}" # Specific to A3+
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
HEALTH_APP: "nccl"
PAIRING_MODE: "random"
SECOND_PASS_ENABLED: "true"
# Blast Mode
blast_mode:
blast_mode_enabled: true
env:
# BLAST_MODE_NUM_TESTS_LIMIT: "200" # Number of health checks to run in parallel
NODES_CHECKED_PER_TEST: "2"
gpu_healthcheck:
run_check: false
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/gpu_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "--set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
HC_ENV_R_LEVEL: 3
# Blast Mode
blast_mode:
blast_mode_enabled: true # Defaults to run multiple health checks in parallel
env:
# BLAST_MODE_NUM_TESTS_LIMIT: "200" # Number of health checks to run in parallel
NODES_CHECKED_PER_TEST: "1"
neper_healthcheck:
run_check: false
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/neper_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "--set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
blast_mode:
blast_mode_enabled: true # Defaults to run multiple health checks in parallel
env:
# BLAST_MODE_NUM_TESTS_LIMIT: "200" # Number of health checks to run in parallel
NODES_CHECKED_PER_TEST: "2"
straggler_healthcheck:
run_check: false
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
HELM_CHART: "/app/health_checks/straggler_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "--set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}"
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
HOSTS_CSV: nil # Allow health runner to identify the nodes
N_NODES: nil # Default to run on all nodes in the cluster
GCS_BUCKET_NAME: "straggler-healthcheck-logs"
blast_mode:
blast_mode_enabled: false # Defaults to run multiple health checks in parallel
tinymax_healthcheck:
run_check: false
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "10"
FILTER_LABEL_NAME: "aiinfra/tinymax-healthcheck-test"
FILTER_LABEL_VALUE: "true"
HELM_CHART: "/app/health_checks/tinymax_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "-f /app/health_checks/tinymax_healthcheck/a3plus.yaml --set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}" # Specific to A3+
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
# Blast Mode
blast_mode:
blast_mode_enabled: true
env:
# BLAST_MODE_NUM_TESTS_LIMIT: "200" # Number of health checks to run in parallel
NODES_CHECKED_PER_TEST: "1"
nccl_cluster_healthcheck:
run_check: false
image:
repo: "us-docker.pkg.dev/gce-ai-infra/health-check/health-runner"
tag: "4.2-latest"
pull_policy: "Always"
env:
HC_IMAGE_TAG: "4.2-latest"
MACHINE_TYPE: "a3-megagpu-8g"
DRY_RUN: "true"
SLEEP_TIME_MINUTES: "30"
FILTER_LABEL_NAME: "aiinfra/nccl-healthcheck-test"
FILTER_LABEL_VALUE: "true"
HELM_CHART: "/app/health_checks/nccl_healthcheck" # Path to Helm chart in container
HELM_INSTALL_FLAGS: "-f /app/health_checks/nccl_healthcheck/a3plus.yaml --set health_check.image.tag=${MACHINE_TYPE}_${HC_IMAGE_TAG}" # Specific to A3+
ACCELERATOR_TYPE: "nvidia-h100-mega-80gb"
SECOND_PASS_ENABLED: "true"
HC_ENV_NHOSTS: "4"
# Blast Mode
blast_mode:
blast_mode_enabled: true
env:
NODES_CHECKED_PER_TEST: "4"
To start with, download the repository: with
git clone https://github.com/GoogleCloudPlatform/cluster-health-scanner.git
cd cluster-health-scanner/
Running CHS involves installing Health Runner. This is done on a Kubernetes orchestration by deploying the Helm chart for Health Runner.
The Health Runner Helm chart can be used to install the release using the
helm
command shown below:
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner
This will install the Health Runner with the default configuration which will kick off the health checks automatically to be run on the nodes in the cluster.
You can also specify your own configuration using your own value files:
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release-custom-config"
MY_CONFIG="./my-config.yaml"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner \
-f "${MY_CONFIG}"
You can also set specific configurations in the command line using
helm install
--set
parameter.
For example, the following command launches only the GPU health check on the
nodes using R_LEVEL: "1"
instead of the default values.
MY_HEALTH_RUNNER_RELEASE_NAME="my-hr-release-gpu-only"
helm install "${MY_HEALTH_RUNNER_RELEASE_NAME}" \
deploy/helm/health_runner \
--set health_checks.nccl_healthcheck.run_check=false \
--set health_checks.gpu_healthcheck.run_check=true \
--set health_checks.gpu_healthcheck.R_LEVEL="1" \
As the Health Runner launches health checks, runs them on nodes, and they complete, users can view the health check results.
Health check results are stored as node labels and can be viewed using the
Kubernetes kubectl
tool.
The following command displays results for the NCCL health check for each node:
CUSTOM_COLS="NODE:.metadata.name,MARK:.metadata.labels.aiinfra/nccl-healthcheck-test,BANDWIDTH:.metadata.labels.aiinfra/nccl-healthcheck-bandwidth,RESULT:.metadata.labels.aiinfra/nccl-healthcheck-result,RUNTIME:.metadata.labels.aiinfra/nccl-healthcheck-runtime-sec"
kubectl get nodes -o custom-columns="${CUSTOM_COLS}"
This outputs a table with columns showing the node names and the status of each of their tags.
If the command watch
is installed, you can create a dynamic display for live
updates.
watch -n 10 -d "kubectl get nodes -o custom-columns=${CUSTOM_COLS}"
watch
reruns the table display command every 10 seconds, highlighting any
changes.
After deploying and running CHS, users should ensure that the installation is fully cleaned up. This will prevent any potential issues of lingering configurations, jobs, or other resources.
To uninstall the Health Runner (a Helm release), use the release name
(MY_HEALTH_RUNNER_RELEASE_NAME
) in the following command:
helm uninstall "${MY_HEALTH_RUNNER_RELEASE_NAME}"
While the Health Runner Helm chart simplifies cleanup, it's important to remove any lingering Jobs in the cluster that are not removed automatically.
You can list these with a command like the following:
kubectl get jobs | grep "chs-hc-"
To remove lingering Jobs:
kubectl delete jobs $JOB_NAME_0 $JOB_NAME_1
Because Jobs from CHS tend to have similar names, you can filter those jobs
by name (such as healthcheck
in this example) with something like below:
# Gets list of jobs, filters for `healthcheck`, selects only the Job name
kubectl get jobs \
| grep "chs-hc-" \
| cut -d ' ' -f1
After confirming the jobs listed are the ones to delete, you can use the command below to delete those jobs:
kubectl get jobs --no-headers \
| grep "chs-hc-" \
| cut -d ' ' -f1 \
| xargs kubectl delete jobs