-
Notifications
You must be signed in to change notification settings - Fork 515
[gpu] work on continuous integration testing #1205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gpu/install_gpu_driver.sh * replace add_nonfree_components with add_contrib_components * perform the apt-get update outside of the update function * using OS comparison functions instead of string comparisons * small grammar fix gpu/test_gpu.py * only testing 11.8 and 12.4 CUDA variants * removed conditional test skips * using g2 machine type and GPU_L4 when testing mig
|
Manually tested: ✅ 2.2-rocky9 with secure boot DISabled, CUDA=11.8 |
* redirecting to file descriptor 2 instead of the file named 2 * removing nvidia module if it is installed * adding safety checks around the use of nvidia-smi * using appropriate mok file paths * removed uname -r suffix from kernel-devel package name for rocky
|
/gcbrun |
|
It looks like 2.2-ubuntu22 is failing to build the 520 driver |
|
/gcbrun |
1 similar comment
|
/gcbrun |
|
/gcbrun |
modified: gpu/install_gpu_driver.sh * added is_debian10 * corrected codename ; it should be oldoldstable * using is_$distname() functions a bunch function install_nvidia_cudnn * installing libcudnn9 on rocky9 function install_nvidia_gpu_driver * redirect output of module build to log files function hadoop_properties * using /usr/local/bin/bdconfig for bdconfig command so this script works under sudo -i on a repro cluster modified: gpu/test_gpu.py * skipping all tests for now to ease debugging with ci infrastructure * removed logic to skip rocky builds * reduced disk size from 200G to 50G since we're using pd-ssd for the ci project now, and we want to be careful with that expensive resource
modified: gpu/install_gpu_driver.sh * cleaned up variable names * broke out nccl shortname from normal shortname * specifying versions when installing driver and cudatools on rocky
|
/gcbrun |
|
/gcbrun |
|
/gcbrun |
|
Nice. GPU tests are passing again. Kind of. :-) |
|
/gcbrun |
added some whitespace to spark arguments to clarify what is happening
|
/gcbrun |
|
Next round of manual tests ✅ 2.0-debian10 with CUDA=11.8,12.4 |
|
/gcbrun |
…ether to run install_drivers_aliases into the function itself
…ackages for rocky
|
/gcbrun |
|
|
||
| if not FLAGS.skip_cleanup: | ||
| args.append("--max-age=2h") | ||
| args.append("--max-age=30m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does GPU tests take to finish? I thought the cluster creation itself takes lot of time around 30m.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The startup script and init action take about 10-16 minutes from gcloud command issued to cluster in active state. But I'm either merging changes to files outside of gpu/ into Prince's changes or discarding them. I'm discarding this one, but would keep it in gpu/test_gpu.py if I could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see what you're saying. The tests don't take long to run. The delays come from gcloud compute ssh mostly. Once we turn the MIG tests back on, we need to account for nvidia-smi being super slow on systems with 8 attached H100s. The commands are simple and should not take more than 30s to execute each.
With the longest measured duration of fetching, building, installing, signing kernel modules and installing about 10GB of binary support libraries taking about 16 minutes, plus 30s for each of about 10 tests, then we have 21 minutes. We could try reducing it to 25m and see if all tests pass, but I think leaving five minutes of headroom for congested networks or retries will give us a more reliable pass rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are stats from the most recent run
Step #2 - "dataproc-2.0-debian10-tests": 2024-08-02T19:27:53.262772599Z //gpu:test_gpu PASSED in 2313.8s
Step #3 - "dataproc-2.0-rocky8-tests": 2024-08-02T18:58:06.475604281Z //gpu:test_gpu PASSED in 376.8s
Step #5 - "dataproc-2.1-debian11-tests": 2024-08-02T19:29:09.794418578Z //gpu:test_gpu PASSED in 2401.5s
Step #6 - "dataproc-2.1-rocky8-tests": 2024-08-02T18:55:34.482352277Z //gpu:test_gpu PASSED in 268.7s
Step #7 - "dataproc-2.1-ubuntu20-tests": 2024-08-02T19:28:33.297038722Z //gpu:test_gpu PASSED in 2473.7s
Step #8 - "dataproc-2.2-debian12-tests": 2024-08-02T19:19:00.625695199Z //gpu:test_gpu PASSED in 1917.9s
Step #9 - "dataproc-2.2-rocky9-tests": 2024-08-02T18:55:36.599559802Z //gpu:test_gpu PASSED in 264.5s
|
/gcbrun |
…ion_tests/ and cloudbuild/ into prince's branch before squash/merge
gpu/README.md:
gpu/install_gpu_driver.sh
gpu/test_gpu.py:
cloudbuild/presubmit.sh:
integration_tests/dataproc_test_case.py: