[gpu] work on continuous integration testing #1205

cjac · 2024-07-26T21:29:15Z

gpu/README.md:

Clarified which configurations are supported

gpu/install_gpu_driver.sh

using OS comparison functions instead of string comparisons
Updates necessary to bring tests back to passing. Verified and updated a bunch of urls and installation instructions.

gpu/test_gpu.py:

Tests which confirm successful runs of nvidia-smi now pass
Tests to confirm that GPU monitoring agent installation and health pass
Tests to confirm installation of the full CUDA 11 and 12 stacks now pass
Tests to confirm MIG functionality are disabled
Tests to confirm launching of spark jobs with access to gpu are disabled
Tests to confirm xgboost functionality are disabled
Tests are no longer skipped on Rocky
Tests no longer exercise CUDA 10 use cases

cloudbuild/presubmit.sh:

Full test suite run is presently disabled so that I could make modifications to shared resources while performing only GPU tests. The continue should be removed from this change before submitting.

integration_tests/dataproc_test_case.py:

switching to pd-ssd
disabling secure boot for gpu-related actions
max-age changes from 2h to 30m
max-idle set to 5m
boot disk size reduced to 50GB

gpu/install_gpu_driver.sh * replace add_nonfree_components with add_contrib_components * perform the apt-get update outside of the update function * using OS comparison functions instead of string comparisons * small grammar fix gpu/test_gpu.py * only testing 11.8 and 12.4 CUDA variants * removed conditional test skips * using g2 machine type and GPU_L4 when testing mig

cjac · 2024-07-26T21:38:26Z

Manually tested:

✅ 2.2-rocky9 with secure boot DISabled, CUDA=11.8
✅ 2.2-rocky9 with secure boot ENabled, CUDA=12.4
✅ 2.1-rocky8
✅ 2.0-rocky8
✅ 2.2-ubuntu22 with secure boot DISabled, CUDA=11.8
✅ 2.2-ubuntu22 with secure boot ENabled, CUDA=12.4
✅ 2.1-ubuntu20
✅ 2.0-ubuntu18
✅ 2.2-debian12 with secure boot DISabled, CUDA=11.8
✅ 2.2-debian12 with secure boot ENabled, CUDA=12.4,11.8
✅ 2.1-debian11 with CUDA=12.4,11.8
✅ 2.0-debian10 with CUDA=12.4,11.8

* redirecting to file descriptor 2 instead of the file named 2 * removing nvidia module if it is installed * adding safety checks around the use of nvidia-smi * using appropriate mok file paths * removed uname -r suffix from kernel-devel package name for rocky

cjac · 2024-07-27T00:28:16Z

/gcbrun

cjac · 2024-07-27T01:19:36Z

It looks like 2.2-ubuntu22 is failing to build the 520 driver
gs://dataproc-108de5de-43c2-4a4b-979a-adebc15a58a8-us-central1/google-cloud-dataproc-metainfo/51abc585-86ef-4ec2-b84c-8b224f4bb899/test-gpu-standard-2-2-20240727-003840-g5sl-m/dataproc-initialization-script-0_output

cjac · 2024-07-27T02:30:16Z

/gcbrun

cjac · 2024-07-27T20:57:48Z

/gcbrun

…test run logic

cjac · 2024-07-27T22:40:02Z

/gcbrun

modified: gpu/install_gpu_driver.sh * added is_debian10 * corrected codename ; it should be oldoldstable * using is_$distname() functions a bunch function install_nvidia_cudnn * installing libcudnn9 on rocky9 function install_nvidia_gpu_driver * redirect output of module build to log files function hadoop_properties * using /usr/local/bin/bdconfig for bdconfig command so this script works under sudo -i on a repro cluster modified: gpu/test_gpu.py * skipping all tests for now to ease debugging with ci infrastructure * removed logic to skip rocky builds * reduced disk size from 200G to 50G since we're using pd-ssd for the ci project now, and we want to be careful with that expensive resource

modified: gpu/install_gpu_driver.sh * cleaned up variable names * broke out nccl shortname from normal shortname * specifying versions when installing driver and cudatools on rocky

cjac · 2024-07-28T03:27:44Z

/gcbrun

cjac · 2024-07-28T04:14:36Z

/gcbrun

… twice

cjac · 2024-07-28T04:18:04Z

/gcbrun

cjac · 2024-07-28T05:05:15Z

Nice. GPU tests are passing again. Kind of. :-)

cjac · 2024-07-28T05:09:06Z

/gcbrun

added some whitespace to spark arguments to clarify what is happening

cjac · 2024-07-31T07:23:27Z

/gcbrun

cjac · 2024-08-01T06:02:29Z

Next round of manual tests

✅ 2.0-debian10 with CUDA=11.8,12.4
✅ 2.0-ubuntu18 with CUDA=12.4,11.8
✅ 2.0-rocky8 with CUDA=12.4
✅ 2.1-debian11 with CUDA=12.4,11.8
✅ 2.1-ubuntu20 with CUDA=11.8,12.4
✅ 2.1-rocky8 with CUDA=12.4,11.8
✅ 2.2-debian12 with CUDA=12.4,11.8
✅ 2.2-ubuntu22 with CUDA=12.4,11.8
❓ 2.2-rocky9 - kernel-devel was removed from the repos again

cjac · 2024-08-01T09:22:54Z

/gcbrun

…ether to run install_drivers_aliases into the function itself

…ackages for rocky

cjac · 2024-08-02T05:00:54Z

/gcbrun

Deependra-Patel · 2024-08-02T11:30:05Z

integration_tests/dataproc_test_case.py


        if not FLAGS.skip_cleanup:
-            args.append("--max-age=2h")
+            args.append("--max-age=30m")


How long does GPU tests take to finish? I thought the cluster creation itself takes lot of time around 30m.

The startup script and init action take about 10-16 minutes from gcloud command issued to cluster in active state. But I'm either merging changes to files outside of gpu/ into Prince's changes or discarding them. I'm discarding this one, but would keep it in gpu/test_gpu.py if I could.

oh, I see what you're saying. The tests don't take long to run. The delays come from gcloud compute ssh mostly. Once we turn the MIG tests back on, we need to account for nvidia-smi being super slow on systems with 8 attached H100s. The commands are simple and should not take more than 30s to execute each.

With the longest measured duration of fetching, building, installing, signing kernel modules and installing about 10GB of binary support libraries taking about 16 minutes, plus 30s for each of about 10 tests, then we have 21 minutes. We could try reducing it to 25m and see if all tests pass, but I think leaving five minutes of headroom for congested networks or retries will give us a more reliable pass rate.

Here are stats from the most recent run

Step #2 - "dataproc-2.0-debian10-tests": 2024-08-02T19:27:53.262772599Z //gpu:test_gpu PASSED in 2313.8s Step #3 - "dataproc-2.0-rocky8-tests": 2024-08-02T18:58:06.475604281Z //gpu:test_gpu PASSED in 376.8s Step #5 - "dataproc-2.1-debian11-tests": 2024-08-02T19:29:09.794418578Z //gpu:test_gpu PASSED in 2401.5s Step #6 - "dataproc-2.1-rocky8-tests": 2024-08-02T18:55:34.482352277Z //gpu:test_gpu PASSED in 268.7s Step #7 - "dataproc-2.1-ubuntu20-tests": 2024-08-02T19:28:33.297038722Z //gpu:test_gpu PASSED in 2473.7s Step #8 - "dataproc-2.2-debian12-tests": 2024-08-02T19:19:00.625695199Z //gpu:test_gpu PASSED in 1917.9s Step #9 - "dataproc-2.2-rocky9-tests": 2024-08-02T18:55:36.599559802Z //gpu:test_gpu PASSED in 264.5s

…hives

… with cuda11

cjac · 2024-08-02T18:43:43Z

/gcbrun

…ion_tests/ and cloudbuild/ into prince's branch before squash/merge

prince-cs · 2024-08-02T20:36:46Z

The presubmits passed for this PR

https://pantheon.corp.google.com/cloud-build/builds;region=global/99430aff-44a4-4c60-8f33-e7a1c57b0360?inv=1&invt=AbYWdA&mods=dataproc_env_image_prod&project=cloud-dataproc-ci

cjac marked this pull request as draft July 26, 2024 21:29

cjac added 3 commits July 26, 2024 15:28

revert contrib/nonfree change ; minor updates to syntax

f132df4

minor clean-up

0f47397

* redirecting to file descriptor 2 instead of the file named 2 * removing nvidia module if it is installed * adding safety checks around the use of nvidia-smi * using appropriate mok file paths * removed uname -r suffix from kernel-devel package name for rocky

using install instructions from spark-rapids for rocky

30735a3

cjac added 2 commits July 26, 2024 19:00

update the driver version for CUDA 11.8 to latest in 525 line

4f9d0cd

update driver versions ; trying P100 for spark jobs

47fbf1d

attempting to improve performance and temporarily disabling the full …

3ccc8d4

…test run logic

cjac force-pushed the gpu-20240726 branch from 04ec406 to 3ccc8d4 Compare July 27, 2024 22:39

cjac added 5 commits July 27, 2024 17:07

more clean-up

b8061ef

modified: gpu/install_gpu_driver.sh * cleaned up variable names * broke out nccl shortname from normal shortname * specifying versions when installing driver and cudatools on rocky

correct nccl_shortname definition

1e8d30a

using T4 instead of V100 ; no availability in the current CI region

81c07c2

updated README version table ; minor fixes to installer

90c74e3

cjac force-pushed the gpu-20240726 branch from fe114e5 to 90c74e3 Compare July 28, 2024 03:27

reduce max-age from 2h to 30m

e2f27ec

increasing sharding ; testing 12.4 and 11.8 once each instead of 12.4…

0b4360a

… twice

removing some test skips

a65891f

Insufficient L4 GPUs ; using A100

1742dfb

added some whitespace to spark arguments to clarify what is happening

cjac added 2 commits July 31, 2024 00:09

MIG setup and query config

805480c

enabling MIG tests

b449994

2.1-debian11 with cuda=11.8

2e02d96

cjac added 2 commits August 1, 2024 00:55

ubuntu20 fixes

69b1158

tested with 2.1-rocky8 using cuda 12.4

3f4c5c9

cjac added 4 commits August 1, 2024 11:58

corrected package name for ubuntu20

3bc3814

tested with 2.2-debian12 using NV provided packages

ea2d9c7

removed legacy version error handling ; moved condition checks for wh…

68b936f

…ether to run install_drivers_aliases into the function itself

2.2-rocky9 CUDA 11.8 ; clean up some repetition ; only install from p…

80b65e4

…ackages for rocky

Deependra-Patel reviewed Aug 2, 2024

View reviewed changes

cjac added 2 commits August 2, 2024 11:13

Rocky image fails to build due to missing kernel-devel package in arc…

a3dab04

…hives

refactored rocky install instructions ; for ubuntu22 use newer driver…

49f4782

… with cuda11

cjac marked this pull request as ready for review August 2, 2024 20:23

cjac requested a review from prince-cs August 2, 2024 20:23

cjac marked this pull request as draft August 2, 2024 20:23

prince-cs approved these changes Aug 2, 2024

View reviewed changes

remove continue and review with possible merge of changes to integrat…

0b900f2

…ion_tests/ and cloudbuild/ into prince's branch before squash/merge

cjac marked this pull request as ready for review August 2, 2024 20:35

fixed getGpusResources.sh generation

d742944

cjac merged commit 5ced7a6 into GoogleCloudDataproc:master Aug 2, 2024

This was referenced Aug 6, 2024

[gpu] Add tests for GPU agent #1154

Closed

gpu driver installer fails on 2.1 image with cuda=11.8 #1058

Closed

cjac mentioned this pull request Dec 16, 2024

[gpu] apt-get update Init script seeing broken repositories #1129

Closed

cjac self-assigned this May 22, 2025

[gpu] work on continuous integration testing #1205

[gpu] work on continuous integration testing #1205

Uh oh!

Conversation

cjac commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Jul 27, 2024

Uh oh!

cjac commented Jul 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Jul 27, 2024

Uh oh!

cjac commented Jul 27, 2024

Uh oh!

cjac commented Jul 27, 2024

Uh oh!

cjac commented Jul 28, 2024

Uh oh!

cjac commented Jul 28, 2024

Uh oh!

cjac commented Jul 28, 2024

Uh oh!

cjac commented Jul 28, 2024

Uh oh!

cjac commented Jul 28, 2024

Uh oh!

cjac commented Jul 31, 2024

Uh oh!

cjac commented Aug 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjac commented Aug 1, 2024

Uh oh!

cjac commented Aug 2, 2024

Uh oh!

Deependra-Patel Aug 2, 2024

Choose a reason for hiding this comment

Uh oh!

cjac Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjac Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjac Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjac commented Aug 2, 2024

Uh oh!

prince-cs commented Aug 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjac commented Jul 26, 2024 •

edited

Loading

cjac commented Jul 26, 2024 •

edited

Loading

cjac commented Jul 27, 2024 •

edited

Loading

cjac commented Aug 1, 2024 •

edited

Loading

cjac Aug 2, 2024 •

edited

Loading

cjac Aug 2, 2024 •

edited

Loading

cjac Aug 2, 2024 •

edited

Loading